Methods, systems, computer readable media, and kits for sample identification

ABSTRACT

A method for sequencing a polynucleotide sample having a barcode sequence, includes: introducing a series of nucleotides to the polynucleotide sample according to a predetermined flow ordering; obtaining a series of signals resulting from the introducing of nucleotides to the polynucleotide sample; and resolving the series of signals over the barcode sequence to render a flowspace string, wherein the flowspace string is a codeword of an error-tolerant code capable of distinguishing the barcode sequence from other barcode sequences in the presence of one or more errors.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/599,876, filed Aug. 30, 2012, which claims the benefit of U.S. Prov.Pat. Appl. No. 61/545,290, filed Oct. 10, 2011 (now expired), and U.S.Prov. Pat. Appl. No. 61/529,687, filed Aug. 31, 2011 (now expired),which are all incorporated by reference herein in their entireties.

This application contains a Sequence Listing, which has been submittedin ASCII format via EFS-Web and is hereby incorporated by reference inits entirety. Said ASCII copy is named LT00574_ST25.txt, was created onNov. 5, 2012, and is 4,144 bytes in size.

FIELD

This application generally relates to methods, systems, computerreadable media, and kits for sample identification, and, morespecifically, to methods, systems, computer readable media, and kits fordesigning and/or making and/or using sample discriminating codes orbarcodes for identifying sample nucleic acids or other biomolecules orpolymers.

BACKGROUND

Various instruments, apparatuses, and/or systems perform sequencing ofnucleic acid sequences using sequencing-by-synthesis, including, forexample, the Genome Analyzer/HiSeq/MiSeq platforms (Illumina, Inc.; see,e.g., U.S. Pat. Nos. 6,833,246 and 5,750,341); the GS FLX, GS FLXTitanium, and GS Junior platforms (Roche/454 Life Sciences; see, e.g.,Ronaghi et al., SCIENCE, 281:363-365 (1998), and Margulies et al.,NATURE, 437:376-380 (2005)); and the Ion PGM™ and Ion Proton™ Sequencers(Life Technologies Corp./Ion Torrent; see, e.g., U.S. Pat. No. 7,948,015and U.S. Pat. Appl. Publ. Nos. 2010/0137143, 2009/0026082, and2010/0282617, which are all incorporated by reference herein in theirentirety). In order to increase sequencing throughput and/or lower costsfor sequencing-by-synthesis (and other sequencing methods such as, e.g.,sequencing-by-hybridization, sequencing-by-ligation, etc.), there is aneed for new methods, systems, computer readable media, and kits thatallow highly efficient preparation and/or identification of samples ofpotentially high complexity.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated into and form a partof the specification, illustrate one or more exemplary embodiments andserve to explain the principles of various exemplary embodiments. Thedrawings are exemplary and explanatory only and are not to be construedas limiting or restrictive in any way.

FIG. 1A illustrates an exemplary system for obtaining, processing,and/or analyzing multiplex nucleic acid sequencing data.

FIG. 1B illustrates an exemplary method for obtaining, processing,and/or analyzing multiplex nucleic acid sequencing data.

FIG. 2A illustrates components of an exemplary system for nucleic acidsequencing.

FIG. 2B illustrates an exemplary system for nucleic acid sequencing.

FIG. 3A illustrates cross-sectional and expanded views of an exemplaryflow cell for nucleic acid sequencing.

FIG. 3B illustrates an exemplary uniform flow front between successivereagents moving across a section of an exemplary microwell array.

FIG. 4 illustrates an exemplary process for label-free, pH-basedsequencing.

FIG. 5 shows an exemplary ionogram representation of signals from whichbase calls may be made.

FIGS. 6A and 6B demonstrate a relationship between a base space sequenceand a flowspace vector.

FIG. 7A illustrates an exemplary translation to flowspace of 36 9-merternary Golay codewords permissible under a predetermined flow ordering.

FIG. 7B illustrates an exemplary translation to flowspace of 332 ternaryGolay codewords permissible under the same predetermined flow orderingas in FIG. 7A.

FIG. 7C illustrates exemplary Golay flowspace barcode weights for thecodewords represented in FIG. 7A.

FIG. 8A illustrates 8 nonzero ternary tetracode codewords.

FIG. 8B illustrates 288 nonzero ternary concatenated tetracode codewordspermissible under the same predetermined flow ordering as in FIG. 7A.

FIG. 9 illustrates a pool of seven different polynucleotide strands,each with a unique barcode sequence.

FIGS. 10A-10C illustrate an exemplary workflow for preparing a multiplexsample.

FIG. 11 illustrates an exemplary beaded template.

FIG. 12 illustrates another exemplary beaded template.

FIG. 13 illustrates an exemplary mate pair beaded template.

FIG. 14A illustrates an exemplary barcoded adaptor.

FIG. 14B illustrates an exemplary beaded template.

FIG. 15 illustrates another exemplary beaded template.

EXEMPLARY EMBODIMENTS

The following description and the various embodiments described hereinare exemplary and explanatory only and are not to be construed aslimiting or restrictive in any way. Other embodiments, features,objects, and advantages of the present teachings will be apparent fromthe description and accompanying drawings, and from the claims.

In accordance with the teachings and principles embodied in thisapplication, new methods, systems, computer readable media, and kitsthat allow highly efficient preparation and/or identification of samplesof potentially high complexity are provided. The new methods, systems,computer readable media, and kits may help increase throughput byallowing contemporaneous sequencing and/or analysis of multiple samples(e.g., multiplexed sequencing), facilitated by using samplediscriminating codes or coded molecular constructs. Multiplexedsequencing may allow multiple coded samples (for example, differentsamples or samples from different sources) to be analyzed substantiallysimultaneously in a single sequencing run (e.g., on a common slide,chip, substrate, or other sample holder device).

In accordance with the teachings and principles embodied in thisapplication, new methods, systems, computer readable media, and kitsallowing identification of an origin of samples used in multiplexedsequencing are provided. Such identification may involve an analysis ofsequencing data for the samples. The source of the sequencing data maybe uniquely tagged, coded, or identified (e.g., to resolve a particularnucleic acid species associated with a particular sample population).Such identification may be facilitated by using unique samplediscriminating codes or sequences (also known as barcodes, e.g., nucleicacid barcodes) that may be embedded within or otherwise associated withthe samples. Such discriminators are not, however, immune to errors ormisreads that may occur during sequencing. For example, an erroneousbarcode read may alter interpretation of the barcode information, whichmay render the barcode unrecognizable and prevent correct sampleidentification. An erroneous barcode read may also result in theassociation of a sample to an incorrect sample source or population oforigin, which may be particularly useful in clinical samples.

In accordance with the teachings and principles embodied in thisapplication, new methods, systems, computer readable media, and kitsthat may help address the problem of detecting and/or correcting errorsthat can arise during the sequencing of samples comprising barcodes areprovided. In various embodiments, sample discriminating codes orsequences or barcodes and methodologies for developing robust samplediscriminating codes or sequences or barcodes that incorporate anerror-tolerant code (e.g., an error-correcting code or anerror-detecting code) are provided.

Unless otherwise specifically designated herein, terms, techniques, andsymbols of biochemistry, cell biology, cell and tissue culture,genetics, molecular biology, nucleic acid chemistry, and organicchemistry (including chemical and physical analysis of polymerparticles, enzymatic reactions and purification, nucleic acidpurification and preparation, nucleic acid sequencing and analysis,polymerization techniques, preparation of synthetic polynucleotides,recombinant techniques, etc.) used herein follow those of standardtreatises and texts in the relevant field. See, e.g., Kornberg andBaker, DNA REPLICATION, 2nd ed. (W.H. Freeman, New York, 1992);Lehninger, BIOCHEMISTRY, 2nd ed. (Worth Publishers, New York, 1975);Strachan and Read, HUMAN MOLECULAR GENETICS, 2nd ed. (Wiley-Liss, NewYork, 1999); Birren et al. (eds.), GENOME ANALYSIS: A LABORATORY MANUALSERIES (Vols. I-IV), Dieffenbach and Dveksler (eds.), PCR PRIMER: ALABORATORY MANUAL, and Green and Sambrook (eds.), MOLECULAR CLONING: ALABORATORY MANUAL (all from Cold Spring Harbor Laboratory Press); andHermanson, BIOCONJUGATE TECHNIQUES, 2nd ed. (Academic Press, 2008).

As used herein, “amplifying” generally refers to performing anamplification reaction. As used herein, “amplicon” generally refers to aproduct of a polynucleotide amplification reaction, which includes aclonal population of polynucleotides, which may be single stranded ordouble stranded and which may be replicated from one or more startingsequences. The one or more starting sequences may be one or more copiesof the same sequence, or they may be a mixture of different sequencesthat contain a common region that is amplified such as, for example, aspecific exon sequence present in a mixture of DNA fragments extractedfrom a sample. Preferably, amplicons may be formed by the amplificationof a single starting sequence. Amplicons may be produced by a variety ofamplification reactions whose products comprise replicates of one ormore starting, or target, nucleic acids. Amplification reactionsproducing amplicons may be “template-driven” in that base pairing ofreactants, either nucleotides or oligonucleotides, have complements in atemplate polynucleotide that are required for the creation of reactionproducts. Template-driven reactions may be primer extensions with anucleic acid polymerase or oligonucleotide ligations with a nucleic acidligase. Such reactions include, for example, polymerase chain reactions(PCRs), linear polymerase reactions, nucleic acid sequence-basedamplifications (NASBAs), rolling circle amplifications, for example, orusing rolling circle amplification to form a single body that mayexclusively occupy a microwell as disclosed in Drmanac et al., U.S. Pat.Appl. Publ. No. 2009/0137404, which is incorporated by reference hereinin its entirety. As used herein, “solid phase amplicon” generally refersto a solid phase support, such as a particle or bead, to which isattached a clonal population of nucleic acid sequences, which may havebeen produced by a emulsion PCR, for example.

As used herein, “analyte” generally refers to a molecule or biologicalsample that can directly affect an electronic sensor in a region (suchas a defined space or reaction confinement region or microwell, forexample) or that can indirectly affect such an electronic sensor by aby-product from a reaction involving such molecule or biological celllocated in such region. In an embodiment, an analyte may be a sample ortemplate nucleic acid, which may be subjected to a sequencing reaction,which may, in turn, generate a reaction by-product, such as one or morehydrogen ions, that can affect an electronic sensor. The term “analyte”also comprehends multiple copies of analytes, such as proteins,peptides, nucleic acids, for example, attached to solid supports, suchas beads or particles, for example. In an embodiment, an analyte may bea nucleic acid amplicon or a solid phase amplicon. A sample nucleic acidtemplate may be associated with a surface via covalent bonding or aspecific binding or coupling reaction, and may be derived from, forexample, a shot-gun fragmented DNA or amplicon library (which areexamples of library fragments further discussed herein), or a sampleemulsion PCR process creating clonally-amplified sample nucleic acidtemplates on particles such as IonSphere™ particles. An analyte mayinclude particles having attached thereto clonal populations of DNAfragments, e.g., genomic DNA fragments, cDNA fragments, for example.

As used herein, “primer” generally refers to an oligonucleotide, eithernatural or synthetic, that is capable, upon forming a duplex with apolynucleotide template, of acting as a point of initiation of nucleicacid synthesis and being extended from its 3′ end along the template sothat an extended duplex may be formed. Extension of a primer may becarried out with a nucleic acid polymerase, such as a DNA or RNApolymerase. The sequence of nucleotides added in the extension processmay be determined by the sequence of the template polynucleotide.Primers may have a length in the range of from 14 to 40 nucleotides, orin the range of from 18 to 36 nucleotides, for example, or from N to Mnucleotides where N is an integer larger than 18 and M is an integerlarger than N and smaller than 36, for example. Other lengths are ofcourse possible. Primers may be employed in a variety of amplificationreactions, including linear amplification reactions using a singleprimer, or polymerase chain reactions, employing two or more primers,for example. Guidance for selecting the lengths and sequences of primersmay be found in Dieffenbach and Dveksler (eds.), PCR PRIMER: ALABORATORY MANUAL, 2nd ed. (Cold Spring Harbor Laboratory Press, NewYork, 2003).

As used herein, “polynucleotide” or “oligonucleotide” generally refersto a linear polymer of nucleotide monomers and may be DNA or RNA.Monomers making up polynucleotides are capable of specifically bindingto a natural polynucleotide by way of a regular pattern ofmonomer-to-monomer interactions, such as Watson-Crick type of basepairing, base stacking, Hoogsteen or reverse Hoogsteen types of basepairing, for example. Such monomers and their internucleosidic linkagesmay be naturally occurring or may be analogs thereof, e.g., naturallyoccurring or non-naturally occurring analogs. Non-naturally occurringanalogs may include PNAs, phosphorothioate internucleosidic linkages,bases containing linking groups permitting the attachment of labels,such as fluorophores, or haptens, for example. In an embodiment,oligonucleotide may refer to smaller polynucleotides, for example,having 5-40 monomeric units. Polynucleotides may include the naturaldeoxyribonucleosides (e.g., deoxyadenosine, deoxycytidine,deoxyguanosine, and deoxythymidine for DNA or their ribose counterpartsfor RNA) linked by phosphodiester linkages. However, they may alsoinclude non-natural nucleotide analogs, e.g., including modified bases,sugars, or internucleosidic linkages. In an embodiment, a polynucleotidemay be represented by a sequence of letters (upper or lower case), suchas “ATGCCTG,” and it will be understood that the nucleotides are in5′→3′ order from left to right and that “A” denotes deoxyadenosine, “C”denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotesdeoxythymidine, and that “I” denotes deoxyinosine, and “U” denotesdeoxyuridine, unless otherwise indicated or obvious from context.Whenever the use of an oligonucleotide or polynucleotide requiresenzymatic processing, such as extension by a polymerase, ligation by aligase, oligonucleotides or polynucleotides in those instances may notcontain certain analogs of internucleosidic linkages, sugar moieties, orbases at any or some positions. Unless otherwise noted the terminologyand atom numbering conventions will follow those disclosed in Strachanand Read, HUMAN MOLECULAR GENETICS, 2nd ed. (Wiley-Liss, New York,1999). Polynucleotides may range in size from a few monomeric units,e.g., 5-40, to several thousand monomeric units, for example.

As used herein, “defined space” (or “reaction space,” which may be usedinterchangeably with “defined space”) generally refers to any space orregion (which may be in one, two, or three dimensions) in which at leastsome of a molecule, fluid, and/or solid can be confined, retained and/orlocalized. The space may be a predetermined area (which may be a flatarea) or volume, and may be defined, for example, by a depression or amicro-machined well in or associated with a microwell plate, microtiterplate, microplate, or a chip. The area or volume may also be determinedbased on an amount of fluid or solid, for example, deposited on an areaor in a volume otherwise defining a space. For example, isolatedhydrophobic areas on a generally hydrophobic surface may provide definedspaces. In an embodiment, a defined space may be a reaction chamber,such as a well or a microwell, which may be in a chip. In an embodiment,a defined space may be a substantially flat area on a substrate withoutwells, for example. A defined space may contain or be exposed to enzymesand reagents used in nucleotide incorporation.

As used herein, “reaction confinement region” generally refers to anyregion in which a reaction may be confined and includes, for example, a“reaction chamber,” a “well,” and a “microwell” (each of which may beused interchangeably). A reaction confinement region may include aregion in which a physical or chemical attribute of a solid substratecan permit the localization of a reaction of interest, and a discreteregion of a surface of a substrate that can specifically bind an analyteof interest (such as a discrete region with oligonucleotides orantibodies covalently linked to such surface), for example. Reactionconfinement regions may be hollow or have well-defined shapes andvolumes, which may be manufactured into a substrate. These latter typesof reaction confinement regions are referred to herein as microwells orreaction chambers, may be fabricated using any suitable microfabricationtechniques, and may have volume, shape, aspect ratio (e.g., basewidth-to-well depth ratio), and other dimensional characteristics thatmay be selected depending on particular applications, including thenature of reactions taking place as well as the reagents, by-products,and labeling techniques (if any) that are employed. Reaction confinementregions may also be substantially flat areas on a substrate withoutwells, for example. In various embodiments, microwells may be fabricatedusing any suitable fabrication technique known in the art. Exemplaryconfigurations (e.g., spacing, shape, and volume) of microwells orreaction chambers are disclosed in Rothberg et al., U.S. Pat. Publ. Nos.2009/0127589 and 2009/0026082; Rothberg et al., U.K. Pat. Appl. Publ.No. GB 2461127; and Kim et al., U.S. Pat. No. 7,785,862, which are allincorporated by reference in their entirety.

Defined spaces or reaction confinement regions may be arranged as anarray, which may be a substantially planar one-dimensional ortwo-dimensional arrangement of elements such as sensors or wells. Thenumber of columns (or rows) of a two-dimensional array may or may not bethe same. Preferably, the array comprises at least 100,000 chambers.Preferably, each reaction chamber has a horizontal width and a verticaldepth that has an aspect ratio of about 1:1 or less. Preferably, thepitch between the reaction chambers is no more than about 10 microns.Preferably, each reaction chamber is no greater than 10 μm³ (i.e., 1 pL)in volume, or no greater than 0.34 pL in volume, and more preferably nogreater than 0.096 pL or even 0.012 pL in volume. A reaction chamber maybe 2², 3², 4², 5², 6², 7², 8², 9², or 10² square microns incross-sectional area at the top, for example. Preferably, the array mayhave at least 10², 10³, 10⁴, 10⁵, 10⁶, 10⁷, 10⁸, 10⁹, or more reactionchambers, for example. The reaction chambers may be capacitively coupledto chemFETs.

Defined spaces or reaction confinement regions, whether arranged as anarray or in some other configuration, may be in electrical communicationwith at least one sensor to allow detection or measurement of one ormore detectable or measurable parameter or characteristics. The sensorsmay convert changes in the presence, concentration, or amounts ofreaction by-products (or changes in ionic character of reactants) intoan output signal, which may be registered electronically, for example,as a change in a voltage level or a current level which, in turn, may beprocessed to extract information about a chemical reaction or desiredassociation event, for example, a nucleotide incorporation event. Thesensors may include at least one chemically sensitive field effecttransistor (“chemFET”) that can be configured to generate at least oneoutput signal related to a property of a chemical reaction or targetanalyte of interest in proximity thereof. Such properties can include aconcentration (or a change in concentration) of a reactant, product orby-product, or a value of a physical property (or a change in suchvalue), such as an ion concentration. An initial measurement orinterrogation of a pH for a defined space or reaction confinementregion, for example, may be represented as an electrical signal or avoltage, which may be digitalized (e.g., converted to a digitalrepresentation of the electrical signal or the voltage). Any of thesemeasurements and representations may be considered raw data or a rawsignal.

As used herein, “nucleic acid template” (or “sequencing template,” whichmay be used interchangeably with “nucleic acid template”) generallyrefers to a nucleic acid sequence that is a target of one or morenucleic acid sequencing reactions. A sequence for a nucleic acidtemplate may comprise a naturally-occurring or synthetic nucleic acidsequence. A sequence for a nucleic acid template may also include aknown or unknown nucleic acid sequence from a sample of interest. Invarious embodiments, a nucleic acid template may be attached to a solidsupport such as, e.g., a bead, microparticle, flow cell, or any othersurface, support, or object.

As used herein, “fragment library” generally refers to a collection ofnucleic acid fragments in which one or more fragments are used as asequencing template. A fragment library may be generated in numerousways (e.g., by cutting, shearing, restricting, or otherwise subdividinga larger nucleic acid into smaller fragments). Fragment libraries may begenerated or obtained from naturally occurring nucleic acids, such as,for example, from bacteria, cancer cells, normal cells, or solid tissue.Libraries comprising synthetic nucleic acid sequences may also begenerated to create a synthetic fragment library.

As used herein, “mate pair library” generally refers to a collection ofnucleic acid sequences comprising two or more fragments having aparticular relationship, such as being separated by a known number ofnucleotides, for example. Mate pair fragments may be generated innumerous ways (e.g., by cutting, shearing, restricting, or otherwisesubdividing a larger nucleic acid and associating the sequence fragmentsfrom the ends of the resulting fragments or by associating othersubsequences of the resulting fragments). Mate pair libraries may begenerated, for example, by circularizing a nucleic acid with an internaladapter construct and then removing the middle portion of the nucleicacid to create a linear strand of nucleic acid comprising the internaladapter with the sequences from the ends of the nucleic acid attached toeither end of the internal adapter. Like fragment libraries, mate pairlibraries may be generated or obtained from naturally occurring nucleicacid sequences, such as, for example, from bacteria, cancer cells,normal cells, or solid tissue. Synthetic mate pair libraries may also begenerated, e.g., by attaching synthetic nucleic acid sequences to eitherend of an internal adapter sequence.

As used herein, a “molecular sample discriminating code” (or “molecularbarcode,” which may be used interchangeably with “molecular samplediscriminating code”) generally refers to an identifiable or resolvablemolecular marker, which may be uniquely resolved and may be attached toa sample nucleic acid, biomolecule, or polymer, for example. Such amolecular sample discriminating code may be used for tracking, sorting,separating, and/or identifying sample nucleic acids, biomolecules, orpolymers, and may be designed to have properties useful for manipulatingnucleic acids, biomolecules, polymers, or other molecules. Molecularsample discriminating codes may comprise the same kind or type ofmaterial or subunits comprising the nucleic acid, biomolecule, orpolymer they are intended to identify, or they may comprise one or moredifferent material(s) or subunit(s). A molecular sample discriminatingcode may comprise a short nucleic acid comprising a known orpredetermined sequence. A molecular sample discriminating code may be anucleic acid sample discriminating code (or nucleic acid barcode), whichmay be an identifiable or resolvable nucleotide sequence (e.g., anoligonucleotide or polynucleotide sequence), and which may include oneor more restriction endonuclease recognition sequences or cleavagesites, overhang ends, adaptor sequences, primer sequences, and the like(including combinations of features or properties). A molecular samplediscriminating code may be a biopolymer sample discriminating code,which may include one or more antibody recognition sites, restrictionsites, intra- or inert-molecule binding sites, and the like (includingcombinations of features or properties). A plurality of differentmolecular sample discriminating codes may be used to identify orcharacterize samples belonging to a common group, and may be attachedto, coupled with, or otherwise associated with libraries of nucleicacids, biomolecules, polymers, or other molecules, for example. Amolecular sample discriminating code or molecular barcode may berepresented by a sample discriminating code or sequence or barcode,which may comprise a set of symbols, components, or characters that maybe used to represent or define a molecular sample discriminating code orbarcode. For example, a sample discriminating code or barcode maycomprise a sequence of letters defining a known or predeterminedsequence of nucleic acid bases or other biomolecule or polymerconstituents. Sample discriminating codes or barcodes may be used in avariety of sets, subsets, and groupings. Sample discriminating codes orbarcodes may be read, or otherwise recognized, identified, orinterpreted as a function of a sequence or other arrangement orrelationship of subunits that together form a code. Samplediscriminating codes or barcodes may also contain one or more additionalfunctional elements including key sequences for quality control andsample detection, primer sites, adaptors for ligation, linkers forattaching to substrates, inserts, etc., for example.

FIG. 1A illustrates an exemplary system for obtaining, processing,and/or analyzing multiplex nucleic acid sequencing data. The systemincludes a sequencing instrument 601, a server 602 or other computingmeans or resource, and one or more end user computers 605 or othercomputing means or resource. The sequencing instrument 601 may beconfigured to process samples comprising barcodes and to deliverreagents according to a predetermined ordering. The predeterminedordering may be based on a cyclical, repeating pattern of consecutiverepeats of a predetermined reagent flow ordering (e.g., consecutiverepeats of predetermined sequence of four nucleotide reagents such as“TACG . . . ”), or may be based on a random reagent flow ordering, ormay be based on an ordering comprising in whole or in part aphase-protecting reagent flow ordering as described in Hubbell et al.,U.S. patent application Ser. No. 13/440,849, filed Apr. 5, 2012, whichis incorporated by reference herein in its entirety, or some combinationthereof. The server 602 may include a processor 603 and a memory and/ordatabase 604. The sequencing instrument 601 and the server 602 mayinclude one or more computer readable media for obtaining, processing,and/or analyzing multiplex nucleic acid sequencing data. In anembodiment, the barcodes may be determined at least in part as afunction of the ordering. In an embodiment, the barcodes may comprisescodewords of an error-tolerant code, which codewords may be representedin flowspace rather than base space (e.g., may comprise digits orcharacters corresponding to numbers of nucleotide incorporationsresponsive to predetermined nucleotide flows rather than actual basessuch as nucleic acid bases in base space). In an embodiment, theinstrument and the server or other computing means or resource may beconfigured as a single component. One or more of these components may beused to perform or implement one or more aspects of the embodimentsdescribed herein.

FIG. 1B illustrates an exemplary method for obtaining, processing,and/or analyzing multiplex nucleic acid sequencing data. In step 611, auser obtains multiplex sequencing data from an instrument configured toanalyze sample nucleic acids comprising a nucleic acid barcode designedat least in part as a function of a predetermined or expected reagentflow ordering. The multiplex sequencing data may include signal dataindicative of hydrogen ion concentrations, for example. Thepredetermined ordering may be based on a cyclical, repeating pattern ofconsecutive repeats of a predetermined reagent flow ordering (e.g.,consecutive repeats of predetermined sequence of four nucleotidereagents such as “TACG . . . ”), or may be based on a random reagentflow ordering, or may be based on an ordering comprising in whole or inpart a phase-protecting reagent flow ordering as described in Hubbell etal., U.S. patent application Ser. No. 13/440,849, filed Apr. 5, 2012,which is incorporated by reference herein in its entirety, or somecombination thereof. In step 612, a server or other computing means orresource converts the multiplex sequencing data into sequences of bases.In step 613, the server or other computing means or resource deliversthe multiplex sequencing data and/or sequences of bases to an end user.One or more of these steps and/or components may be used to perform orimplement one or more aspects of the embodiments described herein.

In various embodiments, the methods, systems, computer readable media,and kits described herein may advantageously be used to determine thesequence and/or identity of one or more nucleic acid samples usingsequencing-by-synthesis. In sequencing-by-synthesis, the sequence of atarget nucleic acid may be determined by the stepwise synthesis ofcomplementary nucleic acid strands on a target nucleic acid (whosesequence and/or identity is to be determined) serving as a template forthe synthesis reactions (e.g., by a polymerase extension reaction thattypically includes the formation of a complex comprising a template (ortarget polynucleotide), a primer annealed thereto, and a polymeraseoperably coupled or associated with the primer-template hybrid so as tobe capable of incorporating a nucleotide species (e.g., a nucleosidetriphosphate, a nucleotide triphosphate, a precursor nucleoside ornucleotide) to the primer). During sequencing-by-synthesis, nucleotidesmay be sequentially added to growing polynucleotide molecules or strandsat positions complementary to template polynucleotide molecules orstrands. The addition of the nucleotides to the growing complementarystrands, which may be detected using a variety of methods (e.g.,pyrosequencing, fluorescence detection, and label-free electronicdetection), may be used to identify the sequence composition of thetemplate nucleic acid. This process may be iterated until a complete orselected sequence length complementary to the template has beensynthesized.

In various embodiments, the methods, systems, computer readable media,and kits described herein may advantageously be used to generate,process, and/or analyze data and signals obtained using electronic orcharged-based nucleic acid sequencing. In electronic or charged-basedsequencing (such as, e.g., pH-based sequencing), a nucleotideincorporation event may be determined by detecting ions (e.g., hydrogenions) generated as natural by-products of polymerase-catalyzednucleotide extension reactions. This may be used to sequence a sample ortemplate nucleic acid, which may be a fragment of a nucleic acidsequence of interest, for example, and which may be directly orindirectly attached as a clonal population to a solid support, such as aparticle, microparticle, bead, etc. The sample or template nucleic acidmay be operably associated to a primer and polymerase and may besubjected to repeated cycles or “flows” of deoxynucleoside triphosphate(“dNTP”) addition (which may be referred to herein as “nucleotide flows”from which nucleotide incorporations may result) and washing. The primermay be annealed to the sample or template so that the primer's 3′ endcan be extended by a polymerase whenever dNTPs complementary to the nextbase in the template are added. Based on the known sequence of flows andon measured signals indicative of ion concentration during eachnucleotide flow, the identity of the type, sequence and number ofnucleotide(s) associated with a sample nucleic acid present in areaction chamber can be determined.

FIG. 2A illustrates components of an exemplary system for nucleic acidsequencing. The components include a flow cell and sensor array 100, areference electrode 108, a plurality of reagents 114, a valve block 116,a wash solution 110, a valve 112, a fluidics controller 118, lines120/122/126, passages 104/109/111, a waste container 106, an arraycontroller 124, and a user interface 128. The flow cell and sensor array100 includes an inlet 102, an outlet 103, a microwell array 107, and aflow chamber 105 defining a flow path of reagents over the microwellarray 107. The reference electrode 108 may be of any suitable type orshape, including a concentric cylinder with a fluid passage or a wireinserted into a lumen of passage 111. The reagents 114 may be driventhrough the fluid pathways, valves, and flow cell by pumps, gaspressure, or other suitable methods, and may be discarded into the wastecontainer 106 after exiting the flow cell and sensor array 100. Thereagents 114 may, for example, contain dNTPs to be flowed throughpassages 130 and through the valve block 116, which may control the flowof the reagents 114 to flow chamber 105 (also referred to herein as areaction chamber) via passage 109. The system may include a reservoir110 for containing a wash solution that may be used to wash away dNTPs,for example, that may have previously been flowed. The microwell array107 may include an array of defined spaces or reaction confinementregions, such as microwells, for example, that is operationallyassociated with a sensor array so that, for example, each microwell hasa sensor suitable for detecting an analyte or reaction property ofinterest. The microwell array 107 may preferably be integrated with thesensor array as a single device or chip. The flow cell may have avariety of designs for controlling the path and flow rate of reagentsover the microwell array 107, and may be a microfluidics device. Thearray controller 124 may provide bias voltages and timing and controlsignals to the sensor, and collect and/or process output signals. Theuser interface 128 may display information from the flow cell and sensorarray 100 as well as instrument settings and controls, and allow a userto enter or set instrument settings and controls. The system may beconfigured to let a single fluid or reagent contact the referenceelectrode 108 throughout an entire multi-step reaction. The valve 112may be shut to prevent any wash solution 110 from flowing into passage109 as the reagents are flowing. Although the flow of wash solution maybe stopped, there may still be uninterrupted fluid and electricalcommunication between the reference electrode 108, passage 109, and thesensor array 107. The distance between the reference electrode 108 andthe junction between passages 109 and 111 may be selected so that littleor no amount of the reagents flowing in passage 109 and possiblydiffusing into passage 111 reach the reference electrode 108. In anembodiment, the wash solution 110 may be selected as being in continuouscontact with the reference electrode 108, which may be especially usefulfor multi-step reactions using frequent wash steps. In variousembodiments, the fluidics controller 118 may be programmed to controldriving forces for flowing reagents 114 and the operation of valve 112and valve block 116 with any suitable instrument control software, suchas LabView (National Instruments, Austin, Tex.), to deliver reagents tothe flow cell and sensor array 100 according to a predetermined reagentflow ordering. The reagents may be delivered for predetermineddurations, at predetermined flow rates, and may measure physical and/orchemical parameters providing information about the status of one ormore reactions taking place in defined spaces or reaction confinementregions, such as, for example, microwells.

FIG. 2B illustrates an exemplary system 2101 for nucleic acidsequencing. The system includes a reactor array 2102; a reader board2103; a computer and/or server 2104, which includes a CPU 2105 and amemory 2106; and a display 2107, which may be internal and/or external.One or more of these components may be used to perform or implement oneor more aspects of the exemplary embodiments described herein.

FIG. 3A illustrates cross-sectional and expanded views of an exemplaryflow cell 200 for nucleic acid sequencing. The flow cell 200 includes amicrowell array 202, a sensor array 205, and a flow chamber 206 in whicha reagent flow 208 may move across a surface of the microwell array 202,over open ends of microwells in the microwell array 202. The flow ofreagents (e.g., nucleotide species) can be provided in any suitablemanner, including delivery by pipettes, or through tubes or passagesconnected to a flow chamber. The duration, concentration, and/or otherflow parameters may be the same or different for each reagent flow.Likewise, the duration, composition, and/or concentration for each washflow may be the same or different. A microwell 201 in the microwellarray 202 may have any suitable volume, shape, and aspect ratio, whichmay be selected depending on one or more of any reagents, by-products,and labeling techniques used, and the microwell 201 may be formed inlayer 210, for example, using any suitable microfabrication technique. Asensor 214 in the sensor array 205 may be an ion sensitive (ISFET) or achemical sensitive (chemFET) sensor with a floating gate 218 having asensor plate 220 separated from the microwell interior by a passivationlayer 216, and may be predominantly responsive to (and generate anoutput signal related to) an amount of charge 224 present on thepassivation layer 216 opposite of the sensor plate 220. Changes in theamount of charge 224 cause changes in the current between a source 221and a drain 222 of the sensor 214, which may be used directly to providea current-based output signal or indirectly with additional circuitry toprovide a voltage output signal. Reactants, wash solutions, and otherreagents may move into microwells primarily by diffusion 240. One ormore analytical reactions to identify or determine characteristics orproperties of an analyte of interest may be carried out in one or moremicrowells of the microwell array 202. Such reactions may generatedirectly or indirectly by-products that affect the amount of charge 224adjacent to the sensor plate 220. In an embodiment, a referenceelectrode 204 may be fluidly connected to the flow chamber 206 via aflow passage 203. In an embodiment, the microwell array 202 and thesensor array 205 may together form an integrated unit forming a bottomwall or floor of the flow cell 200. In an embodiment, one or more copiesof an analyte may be attached to a solid phase support 212, which mayinclude microparticles, nanoparticles, beads, gels, and may be solid andporous, for example. The analyte may include a nucleic acid analyte,including a single copy and multiple copies, and may be made, forexample, by rolling circle amplification (RCA), exponential RCA, orother suitable techniques to produce an amplicon without the need of asolid support.

FIG. 3B illustrates an exemplary uniform flow front between successivereagents moving across a section 234 of an exemplary microwell array. A“uniform flow front” between first reagent 232 and second reagent 230generally refers to the reagents undergoing little or no mixing as theymove, thereby keeping a boundary 236 between them narrow. The boundarymay be linear for flow cells having inlets and outlets at opposite endsof their flow chambers, or it may be curvilinear for flow cells havingcentral inlets (or outlets) and peripheral outlets (or inlets). In anembodiment, the flow cell design and reagent flow rate may be selectedso that each new reagent flow with a uniform flow front as it transitsthe flow chamber during a switch from one reagent to another.

FIG. 4 illustrates an exemplary process for label-free, pH-basedsequencing. A template 682 with sequence 685 and a primer binding site681 are attached to a solid phase support 680. The template 682 may beattached as a clonal population to a solid support, such as amicroparticle or bead, for example, and may be prepared as disclosed inLeamon et al., U.S. Pat. No. 7,323,305, which is incorporated byreference herein in its entirety. In an embodiment, the template may beassociated with a substrate surface or present in a liquid phase with orwithout being coupled to a support. A primer 684 and DNA polymerase 686are operably bound to the template 682. As used herein, “operably bound”generally refers to a primer being annealed to a template so that theprimer's 3′ end may be extended by a polymerase and that a polymerase isbound to such primer-template duplex (or in close proximity thereof) sothat binding and/or extension may take place when dNTPs are added. Instep 688, dNTP (shown as dATP) is added, and the DNA polymerase 686incorporates a nucleotide “A” (since “T” is the next nucleotide in thetemplate 682 and is complementary to the flowed dATP nucleotide). Instep 690, a wash is performed. In step 692, the next dNTP (shown asdCTP) is added, and the DNA polymerase 686 incorporates a nucleotide “C”(since “G” is the next nucleotide in the template 682). The pH-basednucleic acid sequencing, in which base incorporations may be determinedby measuring hydrogen ions that are generated as natural by-products ofpolymerase-catalyzed extension reactions, may be performed using atleast in part one or more features of Anderson et al., A SYSTEM FORMULTIPLEXED DIRECT ELECTRICAL DETECTION OF DNA SYNTHESIS, Sensors andActuators B: Chem., 129:79-86 (2008); Rothberg et al., U.S. Pat. Appl.Publ. No. 2009/0026082; and Pourmand et al., DIRECT ELECTRICAL DETECTIONOF DNA SYNTHESIS, Proc. Natl. Acad. Sci., 103:6466-6470 (2006), whichare all incorporated by reference herein in their entirety. In anembodiment, after each addition of a dNTP, an additional step may beperformed in which the reaction chambers are treated with adNTP-destroying agent, such as apyrase, to eliminate any residual dNTPsremaining in the chamber that might result in spurious extensions insubsequent cycles.

In an embodiment, the primer-template-polymerase complex may besubjected to a series of exposures of different nucleotides in apredetermined or known sequence or ordering. If one or more nucleotidesare incorporated, then the signal resulting from the incorporationreaction may be detected, and after repeated cycles of nucleotideaddition, primer extension, and signal acquisition, the nucleotidesequence of the template strand may be determined. The output signalsmeasured throughout this process depend on the number of nucleotideincorporations. Specifically, in each addition step, the polymeraseextends the primer by incorporating added dNTP only if the next base inthe template is complementary to the added dNTP. If there is onecomplementary base, there is one incorporation; if two, there are twoincorporations; if three, there are three incorporations, and so on.With each incorporation, an hydrogen ion is released, and collectively apopulation released hydrogen ions change the local pH of the reactionchamber. The production of hydrogen ions may be monotonically related tothe number of contiguous complementary bases in the template (as well asto the total number of template molecules with primer and polymerasethat participate in an extension reaction). Thus, when there is a numberof contiguous identical complementary bases in the template (which mayrepresent a homopolymer region), the number of hydrogen ions generatedand thus the magnitude of the local pH change is proportional to thenumber of contiguous identical complementary bases (and thecorresponding output signals are then sometimes referred to as “1-mer,”“2-mer,” “3-mer” output signals, etc.). If the next base in the templateis not complementary to the added dNTP, then no incorporation occurs andno hydrogen ion is released (and the output signal is then sometimesreferred to as a “0-mer” output signal). In each wash step of the cycle,an unbuffered wash solution at a predetermined pH may be used to removethe dNTP of the previous step in order to prevent misincorporations inlater cycles. Deliveries of nucleotides to a reaction vessel or chambermay be referred to as “flows” of nucleotide triphosphates (or dNTPs).For convenience, a flow of dATP will sometimes be referred to as “a flowof A” or “an A flow,” and a sequence of flows may be represented as asequence of letters, such as “ATGT” indicating “a flow of dATP, followedby a flow of dTTP, followed by a flow of dGTP, followed by a flow ofdTTP. In an embodiment, the four different kinds of dNTP are addedsequentially to the reaction chambers, so that each reaction is exposedto the four different dNTPs, one at a time. In an embodiment, the fourdifferent kinds of dNTP are added in the following sequence: dATP, dCTP,dGTP, dTTP, dATP, dCTP, dGTP, dTTP, etc., with each exposure,incorporation, and detection steps followed by a wash step. Eachexposure to a nucleotide followed by a washing step can be considered a“nucleotide flow.” Four consecutive nucleotide flows can be considered a“cycle.” For example, a two cycle nucleotide flow order can berepresented by: dATP, dCTP, dGTP, dTTP, dATP, dCTP, dGTP, dTTP, witheach exposure being followed by a wash step. Different flow orders areof course possible. In various embodiments, the predetermined sequenceor ordering may be based on a cyclical, repeating pattern of consecutiverepeats of a predetermined reagent flow ordering (e.g., consecutiverepeats of predetermined sequence of four nucleotide reagents such as“TACG . . . ”), or may be based on a random reagent flow ordering, ormay be based on an ordering comprising in whole or in part aphase-protecting reagent flow ordering as described in Hubbell et al.,U.S. patent application Ser. No. 13/440,849, filed Apr. 5, 2012, whichis incorporated by reference herein in its entirety, or some combinationthereof.

FIG. 5 shows an examplary ionogram representation of signals from whichbase calls may be made. In this example, the x-axis shows the nucleotidethat is flowed and the corresponding number of nucleotide incorporationsmay be estimated by rounding to the nearest integer shown in the y-axis,for example. Signals used to make base calls and determine a flowspacevector may be from any suitable point in the acquisition or processingof the data signals received from sequencing operations. For example,the signals may be raw acquisition data or data having been processed,such as, e.g., by background filtering, normalization, correction forsignal decay, and/or correction for phase errors or effects, etc. Thebase calls may be made by analyzing any suitable signal characteristics(e.g., signal amplitude or intensity).

In various embodiments, output signals due to nucleotide incorporationmay be processed in various way to improve their quality and/orsignal-to-noise ratio, which may include performing or implementing oneor more of the teachings disclosed in Rearick et al., U.S. patentapplication Ser. No. 13/339,846, filed Dec. 29, 2011, and in Hubbell,U.S. patent application Ser. No. 13/339,753, filed Dec. 29, 2011, whichare all incorporated by reference herein in their entirety.

In various embodiments, output signals due to nucleotide incorporationmay be further processed, given knowledge of what nucleotide specieswere flowed and in what order to obtain such signals, to make base callsfor the flows and compile consecutive base calls associated with asample nucleic acid template into a read. A base call refers to aparticular nucleotide identification (e.g., dATP (“A”), dCTP (“C”), dGTP(“G”), or dTTP (“T”)). Base calling may include performing one or moresignal normalizations, signal phase and signal droop (e.g, enzymeefficiency loss) estimations, and signal corrections, and may identifyor estimate base calls for each flow for each defined space. Basecalling may include performing or implementing one or more of theteachings disclosed in Davey et al., U.S. patent application Ser. No.13/283,320, filed Oct. 27, 2011, which is incorporated by referenceherein in its entirety. Other aspects of signal processing and basecalling may include performing or implementing one or more of theteachings disclosed in Davey et al., U.S. patent application Ser. No.13/340,490, filed on Dec. 29, 2011, and Sikora et al., U.S. patentapplication Ser. No. 13/588,408, filed on Aug. 17, 2012, which are allincorporated by reference herein in their entirety.

FIGS. 6A and 6B demonstrate a relationship between a base space sequenceand a flowspace vector. A series of signals representative of a numberof incorporations or lack thereof (e.g., 0-mer, 1-mer, 2-mer, etc.)produced by a series of nucleotide flows may be referred to as aflowspace vector or sequence, as opposed to a base space sequence, whichis simply the order of identified nucleotide bases in a nucleic acid ofinterest. The flowspace vector may be produced using any suitablenucleotide flow ordering, including a predetermined ordering based on acyclical, repeating pattern of consecutive repeats of a predeterminedreagent flow ordering, based on a random reagent flow ordering, or basedon an ordering comprising in whole or in part a phase-protecting reagentflow ordering as described in Hubbell et al., U.S. patent applicationSer. No. 13/440,849, filed Apr. 5, 2012, or some combination thereof. InFIGS. 6A and 6B, an exemplary base space sequence AGTCCA is subjected tosequencing operations using a cyclical flow ordering of TACG (that is, aT nucleotide flow, followed by an A nucleotide flow, followed by a Cnucleotide flow, followed by a G nucleotide flow, and this 4-flowordering is then repeated cyclically). The flows result in a series ofsignals having an amplitude (e.g., signal intensity) related to thenumber of nucleotide incorporations (e.g., 0-mer, 1-mer, 2-mer, etc.).This series of signals generates the flowspace vector 101001021. Asshown in FIG. 6A, the base space sequence AGTCCA may be translated to aflowspace vector 101001021 under a cyclical flow ordering of TACG. Theflowspace vector may change if the flow ordering is changed. As shown inFIG. 6B, the flowspace vector may be mapped back to the base spacesequence associated with the sample.

Barcodes

In various embodiments, sample discriminating codes or barcodes maycomprise or correspond to or with (whether directly or indirectly)sequences of nucleotides, biomolecule components and/or subunits, orpolymer components and/or subunits. In an embodiment, a samplediscriminating code or barcode may correspond to a sequence ofindividual nucleotides in a nucleic acid or subunits of a biomolecule orpolymer or to sets, groups, or continuous or discontinuous sequences ofsuch nucleotides or subunits. In an embodiment, a sample discriminatingcode or barcode may also correspond to or with (whether directly orindirectly) transitions between nucleotides, biomolecule subunits, orpolymer subunits, or other relationships between subunits forming asample discriminating code or barcode.

In various embodiments, sample discriminating codes or barcodes may haveproperties that permit them to be read, or otherwise recognized,identified, or interpreted with improved accuracy and/or reduced errorrates for a given code type, length, or complexity. In an embodiment, asample discriminating code or barcode may be designed as a set (whichmay include subsets) of individual sample discriminating codes orbarcodes. In some embodiments, one or more sample discriminating codesor barcodes in a set (or in a subset from that set) may be selectedbased on one or more criteria to improve accuracy and/or reduce errorrates in reading, or otherwise recognizing, identifying, discriminating,or interpreting the codes.

In various embodiments, sample discriminating codes or barcodes may bedesigned to exhibit high fidelity reads, which may be assessed based onempirical sequencing measurements. The level of fidelity may be based onpredictions of the read accuracy of a sample discriminating code orbarcode having a particular nucleotide sequence. Certain nucleotidesequences known to cause sequencing read ambiguity, errors, orsequencing bias may be avoided. Design may be based on accuratelycalling the sample discriminating code or barcode (and associated sampleor nucleic acid population), even in the presence of one or more errors.In various embodiments, fidelity may be based on the probability ofcorrectly sequencing the sample discriminating code or barcode, whichmay be at least 82%, or at least 85%, or at least 90%, or at least 95%,or at least 99%, or more.

In various embodiments, sample discriminating codes or barcodes may bedesigned to exhibit improved read accuracy for sequencing using asequence-by-synthesis platform (as discussed previously), which mayinclude fluorophore-labeled nucleotide sequencing platforms ornon-labeled sequencing platforms, such as the Ion PGM™ and Ion Proton™Sequencers, for example. Design of the sample discriminating codes orbarcodes and specific sequences are not limited to any particularinstrument platform or sequencing technology, however. In the case ofnon-nucleic acid codes, sample discriminating codes or barcodes may beread, identified, interpreted or otherwise recognized using methodsknown in the art, including for example, amino acid sequencing forprotein sample discriminating codes.

In various embodiments, sample discriminating codes or barcodes may bedesigned to optimize the barcode's observed performance in a sequencingprocess. A design approach may include applying a series of samplediscriminating code or barcode constraints or criteria to achievedesired properties or performance. Such constraints or criteria mayinclude one or more of uniqueness of nucleic acid barcode sequences anda degree of separation from other nucleic acid barcode sequences. A setof barcodes may be a nested set of barcodes, which may be based on oneor more design criteria. In an embodiment, nested barcode sets may bedesigned analogously to Matryoshka nesting, such that the properties ofa subset are entirely contained within the properties of a genus set.For example, a first subset of barcodes meeting certain properties(e.g., a high sequencing fidelity) may be selected from a larger set ofbarcodes meeting the same properties. For example, if a set of barcodescomprises 96 uniquely identifiable barcodes, then a subset of 16barcodes may be selected from the 96 available barcodes for a sequencingexperiment comprises only 16 multiplexed samples. The subset of 16barcodes may thus be optimized to a similar degree as a larger subset of32 barcodes or 48 barcodes selected from the full set of 96 barcodes. Inan embodiment, the barcodes may be designed as an ordered list of nestedbarcodes. In an embodiment, the barcodes (e.g., a set of 96) may beordered by having a first barcode, a second barcode that is the one (ofthe remaining 95) that is furthest from the first one under a suitabledistance measure, a third barcode that is the one (of the remaining 94)that is furthest from the first and second one under a suitable distancemeasure, and so on until all the barcodes have been ordered. A user maythen select an appropriate number of barcode from such an ordered list.

In various embodiments, sample discriminating codes or barcodes maycontain a target sequence that is a sequence of interest, and in suchcases may assist in uniquely identifying or discriminating differenttarget sequences. The target sequence may be any type of sequence fromany source of interest, including amplicons, candidate genes, mutationalhot spots, single nucleotide polymorphisms, genomic library fragments,etc., for example. The sample discriminating code or barcode sequencemay be operatively coupled to the target sequence at any of variouspoints in the sample preparation process using techniques such as PCRamplification, DNA ligation, bacterial cloning, etc., for example. Thesample discriminating code or barcode sequence may be contained inoligonucleotides and ligated to genomic library fragments using anysuitable DNA ligation technique.

In various embodiments, sample discriminating codes or barcodes may havevarious lengths, which may be selected based on a number of samples tobe identified. In various embodiments, for a 16-sample multiplexedsequencing experiment, 16 uniquely identifiable barcodes may besufficient to uniquely identify each sample. Similarly, for a 64- or96-sample multiplexed sequencing experiment, for example, only 64 or 96barcodes may be sufficient, respectively. Larger numbers of samples mayrequire longer codes. However, although longer codes will allowidentification of a larger number of samples, they may have drawbacks.For example, in sequencing-by-synthesis, longer codes may requireadditional numbers of nucleotide flows, which may decrease accuracygiven that sequencing tends to be maximally accurate in earlier flows,where the codes may typically be located. In various embodiments,nucleic acid barcodes, for example, may have a length of about 4-30bases, or about 4-50 bases, or more, for example. The length may beselected based on one or both of a length of the probe sequence and anumber of samples in the experiment. The length may be a multiple of theprobe sequence length. The length may be longer for a larger number ofsamples. For example, in a 16-sample experiment using 5-base probesequences, the barcodes may be 5 bases in length; and in a 96-sampleexperiment using 5-base probe sequences, the barcodes may be 10 bases inlength.

In various embodiments, sample discriminating codes or barcodes may bedesigned based on one or more of the considerations and/or criteria setforth above (which may be taken alone or in combination). For example, aset of nucleic acid barcodes may be designed such that problematicsequences are avoided in all positions. In another example, a set ofnucleic acid barcodes may be designed such that problematic sequencesare avoided and the nucleic acid barcodes are sequenced with highfidelity. Various combinations of considerations/criteria may be chosenbased on the sequencing experiment. For example, if barcodes are to beused for a small number of samples, the barcodes may not necessarily bedesigned to have nested subsets. The design considerations/criteria maybe selected based on the number of samples, the level of accuracyneeded, the sensitivity of the sequencing instrument to detectindividual samples, the accuracy of the sequencing instrument, etc., forexample.

In various embodiments, sample discriminating codes or barcodes may bedesigned to have error-tolerant properties. The error-tolerantproperties may be in base space (in other words, the errors here mayrelated to incorrect bases in the barcode, such as, e.g., a “C” basepresent where a “T” base should be in the barcode). For example, a1-base error-tolerant barcode set may be designed such that if asequencing error is encountered at any position in one or more barcodesof the set each barcode may still be resolvable from other barcodes inthe set (e.g., the set may be such that if one base is incorrect in agiven erroneous barcode, that erroneous barcode may still be resolvedfrom the other barcodes in the set because they all differ from theerroneous barcode in at least two base locations, for example, so thatif the error occurred at one of these two locations, the other remainsavailable to allow distinction between the barcodes). The set may alsobe designed to be able to distinguish barcodes within it even wheremultiple errors (e.g., 2, 3, etc.) in one or more of the barcodes areencountered. Such error-tolerant properties are advantageous as theyhelp provide more certainty for resolving even complex multiplexedsamples, even in the presence of potential sequencing errors. Candidatebarcodes in a set may be compared to ascertain error-tolerantproperties. For example, such barcodes can be compared (e.g., viacomputer analysis or simulation) to determine whether, if any one error(or 2, 3, etc., as the criterion may be) occurred, the codes can stillbe distinguished.

In various embodiments, sample discriminating codes or barcodes as setforth herein may be used in any suitable manner to assist in identifyingor resolving samples. For example, barcodes may be used individually, ortwo or more barcodes may be used in combination. In an embodiment, asingle barcode may identify one target sequence or multiple targetsequences. For example, a single barcode may identify a group of targetsequences. A barcode may be read separately from the target sequence oras part of a larger read operation spanning the barcode and a targetsequence. The barcode may be positioned at any suitable position withinthe sample, including before or after a target sequence.

Barcode Design & Flowspace

In various embodiments, sample discriminating codes or barcodes may bedesigned at least partly based on flowspace considerations, rather thanbase space. In other words, design may be based at least partly onflowspace vectors rather than base sequences. For example, samplediscriminating codes or barcodes may be designed based on projectioninto flowspace as a flowspace vector under a selected or predeterminedflow ordering. A string of characters in the flowspace vector of abarcode (e.g., a string of digits or characters such as 0, 1, 2, etc.,that may represent a non-incorporation, a 1-mer incorporation, a 2-merincorporation, etc., responsive to flows of nucleotide flowed orintroduced according to a predetermined ordering) may represent orcorrespond to a codeword of an error-tolerant code (e.g., anerror-correcting code). In an error-correcting code, a string ofcharacters may be such that errors introduced into the string (e.g.,during sequencing) can be detected and/or corrected based on remainingcharacters in the string. An error-correcting code may be made up of aset of different character strings, which may be referred to ascodewords, over a given finite alphabet E of character elements. Acodeword may be viewed as comprising a message plus some redundant dataor parity data allowing a decoder to correctly decode a codewordcontaining one or more errors. Codewords may be designed to besufficiently separated from one another to allow for a permissiblenumbers of errors to be detected in the transmission of a codeword and,in some cases, to be corrected by calculating which actual codeword isclosest to the received codeword. In sequencing-by-synthesis, forexample, a message string may be considered “transmitted” when a barcodehas been sequenced and projected into flowspace as a flowspace string.

In various embodiments, sample discriminating codes or barcodes may bedesigned using any suitable type of error-correcting code. Theerror-correcting code may be a linear block code using an alphabet E ofcharacter elements with each codeword having n encoding characterelements. Redundancy and/or parity data may be added to the message toallow a receiver to detect and/or correct errors in a transmittedcodeword, and to recover the original message using a suitable decodingalgorithm.

In various embodiments, sample discriminating codes or barcodes may bedesigned using various numbers of character elements in a code alphabet,which may vary according to a particular application. Theerror-correcting code may be a binary code using an alphabet of twocharacter elements. The error-correcting code may be a ternary codeusing an alphabet of three character elements. In an embodiment, thenumber of characters elements may depend on a length of a longesthomopolymer run allowed in the barcode sequence. For example, if abarcode has only 1-mers (no repeating bases), then the error-correctingcode may be a binary code with one character to represent anon-incorporation and another character to represent a single-baseincorporation (e.g., an alphabet E for such a binary code may be {0, 1},with “0” representing a non-incorporation and “1” representing asingle-base incorporation). In another example, if the barcode has only1-mers and 2-mers, then the error-correcting code may be a ternary codewith the same non-incorporation and single-base incorporationcharacters, and a third character to represent a two-base incorporation(e.g., an alphabet E for such a ternary code may be {0, 1, 2}, with “2”representing a two-base incorporation). The size and the set ofcharacters used in other code alphabets may be suitably modified if thebarcode sequence has 3-mers, 4-mers, and so on.

In various embodiments, sample discriminating codes or barcodes may bedesigned using an error-correcting code based at least in part onHamming codes, Golay codes, and/or tetracode codes. In an embodiment,the error-correcting code may be a binary Hamming code. In anotherembodiment, the error-correcting code may be a binary Golay code. In anembodiment, the error-correcting code may be a ternary Hamming code. Inanother embodiment, the error-correcting code may be a ternary Golaycode. See, e.g., Hoffman et al., Coding Theory: The Essentials, MarcelDekker, Inc. (1991); and Lin et al., Error Control Coding: FundamentalsAnd Applications, Prentice Hall, Inc. (1983).

In various embodiments, sample discriminating codes or barcodes may bedesigned to have error-tolerant properties expressed in flowspace ratherthan in base space (in other words, the errors here may related toincorrect digits or characters in the flowspace representation of thebarcode, e.g., an erroneous “0” or “0-mer” present where a “1” or“1-mer” should be in the flowspace representation). For example, a1-base (in flowspace) error-tolerant barcode set may be designed suchthat if a sequencing error is encountered at any position in theflowspace representation in one or more barcodes of the set each barcodemay still be resolvable from other barcodes in the set because theirflowspace representations all differ from the erroneous barcode(s) in atleast two digit locations, so that if the error occurred at one of thesetwo digit locations, the other remains available to allow distinctionbetween the barcodes. The set may also be designed to be able todistinguish barcodes within it even where multiple errors in flowspace(e.g., 2, 3, etc.) in one or more of the barcodes are encountered. Sucherror-tolerant properties are advantageous as they help provide morecertainty for resolving even complex multiplexed samples, even in thepresence of potential sequencing errors. Candidate barcodes in a set maybe compared to ascertain error-tolerant properties in flowspace. Forexample, such barcodes can be compared (e.g., via computer analysis orsimulation) to determine whether, if any one error in flowspace (or 2,3, etc., as the criterion may be) occurred, the codes can still bedistinguished.

In various embodiments, sample discriminating codes or barcodes may bedesigned using one or more distance measures capable of evaluating adistance between codewords. In an embodiment, the distance measure maybe a Hamming distance, which corresponds to the number of positions atwhich two codewords differ. Mathematically, if each codeword in a set ofcodewords has a Hamming distance of at least d from all other codewordsin the set, then the code can correct up to (d−1)/2 errors, orconversely, the Hamming distance d required to decode up to x number oferrors is 2x+1. The quantity d may be referred to as the minimumdistance of the code. The notation [n, k, d] may be used to characterizean error-correcting code of length n digits that encodes k informationdigits and has a minimum distance d. Other distance measures may beused, including a Euclidean distance measure, a sum of absolute valuesof differences between corresponding entries of two codewords, and a sumof squared differences between corresponding entries of two codewords,for example. In various embodiments, using such distance measures canallow a distance between codewords to be evaluated in flowspace, withouta need for base calls to be made.

In various embodiments, sample discriminating codes or barcodes may bedesigned to have an error-correcting code with a minimum distance offive that is capable of correcting up to two digit errors in thecodewords (such as, e.g., an erroneous “0” or “0-mer” present where a“1” or “1-mer” should be at a first location and, e.g., an erroneous “0”or “0-mer” present where a “2” or “2-mer” should be at a second locationin the flowspace representation). In other words, each codeword may havea Hamming distance of at least five from all other codewords. In anotherembodiment, the error-correcting code may have a minimum distance ofthree and be capable of correcting a single digit error in thecodewords. In other words, each codeword may have a Hamming distance ofat least three from all other codewords. In an embodiment, an algorithmor methodology that iteratively compares each codeword in a candidategroup to all others in the group to construct a largest codeword setthat maintains the desired minimum distance and has the desirederror-correcting capability may be used.

In various embodiments, sample discriminating codes or barcodes may bedesigned to distinguish reads in flowspace rather than base space, whichmay be particularly useful in sequencing-by-synthesis and may help avoidan excessive number of flows and thereby reduce error build-up andwasted/diminished sequencing capacity. The Hamming distance, forexample, may not be desirable in base space. For example, a single baseinsertion at the beginning of a sequence (e.g., AACGT vs. ACGT) wouldyield a Hamming distance of 3 (based on the first four bases, to haveequal length) despite an in/del distance of only 1. Further, whentranslating a binary code into 4 letters by paired bits, errorsautomatically affect two bits simultaneously, and an error correction of1 bit is not guaranteed to correct 1 error in base reading. Furthermore,conventional barcode designs may not appropriately address sequencingerror motifs. In an embodiment, codewords may be selected for usefulbiological properties.

In various embodiments, sample discriminating codes or barcodes may bedesigned for one error or two errors correction in flowspace to correctflowspace errors such as, e.g., insertions and deletions. The barcodesmay collectively be designed to be tolerant of single errors inflowspace. A subset of the barcodes may be designed to correct a singleflowspace error. The barcodes may further be divided into subsets that,when used alone, may correct for multiple flowspace errors (e.g., two ormore errors). This may allow for a set of 96 barcodes or more, forexample, that can correct two flowspace errors. The barcode set may begenerated using a ternary coding scheme in flowspace (e.g., inflowspace, the barcodes may be viewed as having either 0, 1, or 2incorporation(s) in a given flow).

In various embodiments, sample discriminating codes or barcodes may bedesigned around a Hamming ternary code mapped into a particularpredetermined flow ordering. For example, such a code may be a [n=13,k=10, d=3] Hamming ternary code; and such a mapping may take the first10 “trits” (e.g., symbols of the ternary code, such as 0, 1, and 2, forexample) and assign them to some of the flows in the predeterminedordering (e.g., flows 9-18), and take three “parity check” trits andassign them to other flows (e.g., 19-21). A final synchronization flowmay then be a 1-mer (e.g., a ‘C’ at flow 22) to result in the flowsterminating the codeword being zero if they are specified to be zero.Some of the codewords generated under the Hamming codes may not bepermissible flowspace representations (e.g., they may be validmathematical codewords in flowspace that do not correspond to a possiblenucleic acid sequence in base space given the predetermined flowordering). These codewords may be filtered out. In some embodiments, thecodewords may be further filtered to include only codewords composed ofa desired length (9-15 bases, for example).

In various embodiments, sample discriminating codes or barcodes may bedesigned around an [n=11, k=6, d=5] ternary Golay code using values 0,1, and 2. Such a code has 729 (i.e., 3⁶) distinct codewords of length 11with a distance of 5 between the codewords to correct 2 errors. Thecodewords may be generated linearly or cyclically or using any suitablemethods. As with the Hamming code, some of the codewords generated underthe ternary Golay code may not correspond to a possible nucleic acidsequence in base space given the predetermined flow ordering and canthus be filtered out.

FIG. 7A illustrates an exemplary translation to flowspace of 36 9-merternary Golay codewords permissible under a predetermined flow ordering.The x-axis corresponds to the first 21 flows of a predetermined flowordering (in this case, TACG, followed by TACG, followed by TCTG,followed by AGCA, followed by TCGA, followed by T (with the remainingflows in that ordering being CGA, followed by TGTA, followed by CAGC)).The y-axis corresponds to homopolymer responses to these flows,reflecting 36 permissible ternary Golay codewords. The color/gray codingindicates the code symbols (black is 0 for a 0-mer, red or light gray is1 for a 1-mer, and blue or dark gray is 2 for a 2-mer). For example, thesequence of 1-mer(s) and 2-mer(s) in the first line (i.e., bottom row)corresponds to T (1-mer responsive to flow 1 of T), followed by C (1-merresponsive to flow 3 of C), followed by A (1-mer responsive to flow 6 ofA), followed by G (1-mer responsive to flow 8 of G), followed by T(1-mer responsive to flow 9 of T), followed by CC (2-mer responsive toflow 10 of C), followed by TT (2-mer responsive to flow 11 of T),followed by G (1-mer responsive to flow 12 of G), followed by AA (2-merresponsive to flow 13 of A), followed by T (1-mer responsive to flow 17of T), followed by C (1-mer responsive to flow 18 of C), followed by T(1-mer responsive to flow 21 of T). In each line, there are 5 base keyTCAGT (corresponding to flows 1-9), 11 flows of code (a permissibleGolay codeword in each line in flows), and 1 flow to resynchronize (flow21 of T). One may use the existing 11-base codewords in the TernaryGolay code to fill in flowspace. Every line differs from any other linein at least five flows. As mentioned previously, there are 729 (i.e.,3⁶) Golay codewords of length 11 in flowspace (e.g., sequences of elevencharacters selected from 0, 1, or 2) and, as illustrated in FIGS. 6A and6B, a flowspace representation may be mapped into a base sequence givena predetermined flow ordering. In base space, there are 262,144 possiblesequences of nine bases selected from A, C, G, or T (i.e., 4⁹). However,not every one of the possible Golay codewords in flowspace correspondsto one of the possible sequences in base space under a givenpredetermined flow ordering. Codewords that do not correspond to apermissible sequence may be filtered out. A computer analysis and/orsimulation may be performed to determine which flowspace codewords, fora given predetermined flow ordering, correspond to a possible sequenceof bases in base space.

FIG. 7B illustrates an exemplary translation to flowspace of 332 ternaryGolay codewords permissible under the same predetermined flow orderingas in FIG. 7A. The graph is essentially the same as in FIG. 7A, exceptthat the permissible Golay codewords are not limited to 9-mer ones. Inother words, of the 729 possible [11, 6, 5] codewords, which have length11 in flowspace (a property of this particular code), these are thecodewords that correspond to a base space sequence that is possible(e.g., that could actually occur) under the predetermined flow orderingof FIG. 7A. Of course, this particular set of codewords is just anexample, and different ones would arise for that ternary Golay code witha different predetermined flow ordering.

FIG. 7C illustrates exemplary Golay flowspace barcode weights for thecodewords represented in FIG. 7A. The x-axis shows the number of basesin each codeword. The y-axis shows the corresponding frequency. Forexample, there are 36 9-mers, 33 10-mers, 57 11-mers, etc. The codewordsare separable under the predetermined flow ordering in FIGS. 7A and 7B,and no particular Hamming distance in base space was applied asconstraint (in other words, that distance in base space need only be atleast 1).

In various embodiments, sample discriminating codes or barcodes may bedesigned around a tetracode code, in which codewords have length 4.There are 8 nonzero codewords (which may take values 0, 1, or 2), andthey can correct 1 error (and detect 2 errors).

FIG. 8A illustrates 8 nonzero ternary tetracode codewords. The x-axisshows the position in the codeword. The y-axis shows each codeword.Color/gray coding indicates the code symbols (black is 0 for a 0-mer,red or light gray is 1 for a 1-mer, and blue or dark gray is 2 for a2-mer). Such codewords are always permissible in a (repeating) 4-baseflow order XYZW.

FIG. 8B illustrates 288 nonzero ternary concatenated tetracode codewordspermissible under the same predetermined flow ordering as in FIG. 7A.The x-axis shows the flow number. The y-axis shows each codeword. Ineach line, there are 5 base key TCAGT (corresponding to flows 1-9), 12flows of code (3 concatenated tetracode codewords), and 1 flow toresynchronize (flow 22 of C). Here, not all flow sequences arepermissible because of interactions.

According to an embodiment, there is provided a method for generatingbarcodes, comprising: designing codewords based on a predetermined floworder and using a ternary code, wherein the codewords (1) are expressedin flowspace; (2) include 0-mer(s), 1-mer(s), or 2-mer(s) only; (3) cancorrect 2 flowspace errors; and (4) are permissible flowspace sequencesthat correspond to base space sequences that are possible given thepredetermined flow order. The ternary code may be a Hamming code, aGolay code, and a tetracode code, for example. The barcodes may becompact in flowspace (e.g., post-key all barcodes may be designed to endat flow 21, with a length 11 ternary code, and a 5-base key); and mayhave a buffer base at flow 21.

Barcodes as described herein may be particularly useful forsequencing-by-synthesis methods that operate by unterminatedsingle-nucleotide addition, in which the precursor nucleotides arerepeatedly added individually to the reaction in series according to apredetermined ordering (such as, e.g., methods based on hydrogen ionsproduced by the incorporation reactions or on the detection of inorganicpyrophosphate, which may be detected by light emitted from an enzymecascade initiated by the inorganic pyrophosphate). In such methods,errors of insertions and/or deletions may arise due to inaccuratebase-calling in flowspace. For example, in the sequence ACCGT, the Cbase 2-mer may be incorrectly called as a 1-mer in flowspace, resultingin the omission (deletion) of a C base in the sequence read (in otherwords, that sequence could be incorrectly read as ACGT). Barcodes asdescribed herein may be particularly useful for detecting and/orcorrecting such read errors.

Various algorithms and/or software tools may be used to assist in thegeneration of error-correcting codes. A number of different designconsiderations may factor into the development of a coding strategy. Asexplained above, a barcode sequence and a flowspace codeword have amapping relationship to each another for a given flow ordering. Thus,design or selection criteria with respect to the barcode sequence may betranslated into corresponding design/selection criteria for theflowspace coding. Likewise, design/selection criteria with respect tothe flowspace coding may be translated into correspondingdesign/selection criteria for the barcode sequence.

In various embodiments, the size of an error-correcting code (e.g., thenumber of codewords) may be varied as needed. A larger code is generallymore constrained and difficult to construct, but may accommodate alarger number of multiplex samples. In an embodiment, theerror-correcting code may contain at least 16 different codewords; or atleast 32 different codewords; or at least 48 different codewords; or atleast 96 different codewords; or more, for example. Because codewordsand barcode sequences are related by flow mapping, these embodiments maytranslate to at least 16 different barcode sequences; or at least 32different barcode sequences; or at least 48 different barcode sequences;or at least at least 96 different barcode sequences; or more, forexample. Generally, the size of a code becomes smaller as the minimumdistance of the code is increased, and such a code generally mayaccommodate fewer multiplex samples but with a better error-correctingcapability. An appropriate balance between the error-correctingcapability of the code and the size of the code may be considered in thecode design.

In various embodiments, the length of flowspace codewords may be variedas needed. For example, if a larger code set (e.g., with more codewords)is needed, the length of the codewords may be increased. In anembodiment, the flowspace codewords may have a length of 15 or fewerdigits, for example. In some embodiments, the flowspace codewords mayhave a length of four, which can be constructed of eight non-zerocodewords that can detect two errors or correct one error. Other factorsthat may be considered include a desired error-correcting capability(e.g., the number of errors that the code can correct) and a decodingcomplexity (e.g., the computational time needed to decode eachcodeword). Also, flowspace codewords that do not correspond to any validsequence under the predetermined flow ordering (e.g., in a cyclicalfour-nucleotide flow ordering, codewords having four 0's in a row) maybe excluded.

In various embodiments, the length of the barcode sequences (or numberof nucleotide bases) may be varied as needed. In an embodiment, thelength may be limited to improve sequencing efficiency. In variousembodiments, the barcode sequences may have a length of 20 or fewerbases; or in the range of 5-15 bases, for example. In variousembodiments, the length of homopolymer runs in the barcodes may belimited to be compatible with the flowspace coding scheme. In someembodiments, the barcode sequences may contain only 1-mers (that is, norepeating bases), and the error-correcting code may then be a binarycode. In some embodiments, the maximum length of homopolymer runs may be3-mers or 2-mers, and the error-correcting code may then be a quaternarycode or a ternary code, respectively.

In various embodiments, one or more molecular biology properties ofnucleotide base sequences may be considered in the design or selectionof barcode sequences. For example, the barcode sequences may be designedor selected to avoid certain nucleotide sequences known to causesequencing read errors or result in sequencing bias. This can enhancePCR and/or sequencing performance. In some embodiments, because the GC(guanine/cytosine) content of a sequence can affect sequencing quality,barcode sequences having a GC content in the range of 40-60% may bepreferred and barcodes outside this range may be filtered out. The ATcontent may also be similarly treated. In some cases, barcode sequencesthat are self-complementary or complementary with a primer sequence thatis coupled to the barcode may be excluded.

An exemplary set of nucleic acid barcodes may be found in Appendix I ofU.S. Prov. Pat. Appl. No. 61/529,687, filed Aug. 31, 2011, which isincorporated by reference herein in its entirety. Such nucleic acidbarcodes (including various combinations, sub-combinations, and portionsthereof), may be used, for example, in a multiplexed sequencingexperiment with different samples.

In an embodiment, for a multiplex sequencing problem requiring a set of96 barcodes that can correct two errors in the flowspace string, withsome accommodation for potential loss due to poorly behaving barcodes,and a predetermined flow ordering of TACG, followed by TACG, followed byTCTG, followed by AGCA, followed by TCGA, followed by TCGA, followed byTGTA, followed by CAGC, for example, a set of barcode sequences may begenerated using a ternary Hamming code of 13-digit length, with ten ofthe digits being treated as data and three of the digits being treatedas parity checks in the codeword. This particular coding scheme yieldsabout 140 codewords that can correct up to two errors. A representativesampling of this particular coding scheme is shown in Table 1 below. (Acomplete list of the barcodes and flowspace codewords generated can beseen in Appendix 1 of U.S. Pat. Appl. No. 61/529,687, filed Aug. 31,2011, which is incorporated by reference herein in its entirety.) Inthis example, the barcodes were selected for those having 9-11 bases inlength and were designed for use in oligonucleotides for multiplexsequencing on an Ion PGM™ sequencing instrument. The oligonucleotidesfor this example contained, in the following order, a primer site, aTCAG key sequence for quality control and sample detection, a uniquebarcode sequence, a common C base at the 3′ end of the barcode sequencesfor synchronization to ensure that flows terminating the codeword arezero if they are specified to be zero, and a GAT buffer between thebarcode and the insert to minimize the influence of the variable barcoderegion on ligation of the adapter. This GAT buffer is the same lastthree bases as the P1 adapter used for the Ion PGM™ sequencinginstrument. The information in Table 1 is organized according to theserial number of the barcodes that were generated. The second columnshows the key sequence, barcode sequences, and the common C base. Thethird column shows the barcode sequences and the common C base. Thefourth column shows the projection of the combined sequence elementsinto flowspace. In the table, the bases and flowspace vector elementscorresponding to the barcodes are all indicated in bold. In theflowspace mapping, flows 1-8 were assigned to the key sequence, flows9-18 were assigned to the data digits of the barcode, and flows 19-21were assigned to the parity digits. Because all the barcodes werefollowed immediately by a common C base, flow 22 was provided forsynchronization. Naturally, such keys, synchronization bases, and/orbuffers could be varied.

TABLE 1 Exemplary barcodes and projections into flowspace. Serial No.Key + Barcode + C Barcode + C Flowspace Vector  1TCAGTCCTCGAATC (SEQ ID NO: 1) TCCTCGAATC (SEQ ID NO: 2)1010010112100010001211  2 TCAGCTTGCGGATC (SEQ ID NO: 3)CTTGCGGATC (SEQ ID NO: 4) 1010010101210010002111  4TCAGTCTAACGGAC (SEQ ID NO: 5) TCTAACGGAC (SEQ ID NO: 6)1010010111102010002101  5 TCAGTTCTTAGCGC (SEQ ID NO: 7)TTCTTAGCGC (SEQ ID NO: 8) 1010010121201110001001  6TCAGTGAGCGGAAC (SEQ ID NO: 9) TGAGCGGAAC (SEQ ID NO: 10)1010010110011110002201  7 TCAGTTAAGCGGTC (SEQ ID NO: 11)TTAAGCGGTC (SEQ ID NO: 12) 1010010120002110002011  9TCAGCTGACCGAAC (SEQ ID NO: 13) CTGACCGAAC (SEQ ID NO: 14)1010010101111020001201 11 TCAGTCTAGAGGTC (SEQ ID NO: 15)TCTAGAGGTC (SEQ ID NO: 16) 1010010111101101002011 12TCAGAAGAGGATTC (SEQ ID NO: 17) AAGAGGATTC (SEQ ID NO: 18)1010010100002101002121

According to an embodiment, there is provided a set of barcodes fornucleic acid sequencing, comprising: a plurality of nucleic acid basesequences, each nucleic acid base sequence being a base spacerepresentation of a flowspace representation of a codeword of anerror-tolerant sample discrimination code. The error-tolerant samplediscrimination code may be tolerant to at least two flowspace errors.The error-tolerant sample discrimination code may be an error-correctingcode, which may be a ternary Hamming code, or a ternary Golay code, or aternary tetracode code. The error-tolerant sample discrimination codemay be used in various methods to detect flowspace errors (e.g., detectthat a particular digit, e.g., “1,” for a given flow should be “2”)using any suitable error detection algorithm. Although detection maysometimes suffice and correction is not necessarily required, theerror-tolerant sample discrimination code may also be used in variousmethods to correct such detected errors using any suitable errorcorrection algorithm. Such a set of barcodes may be used to distinguishsamples based on a flowspace vector obtained during sequencing (e.g.,using sequencing-by-synthesis) before making any actual base calls,which may be useful to perform highly rapid and efficient preliminarysample discrimination.

Any suitable decoding algorithms and/or software tools may be used fordecoding the flowspace strings read from the barcode sequences tocorrect and/or detect errors. For example, the decoding can be performedusing an exhaustion algorithm in which a damaged codeword is compared toall other members of the code and decoded as the closest matchingcodeword. If the damaged codeword is equally close to two codewords oris further than half the minimum distance from any codeword, then thealgorithm may indicate that an error is detected without making anycorrections. In another example, the decoding may involve performing thecoding operation in reverse. In another example, the decoding algorithmmay use linear algebra techniques to decode the codeword.

A method for sequencing a polynucleotide sample having a barcodesequence, comprising: introducing a series of nucleotides to thepolynucleotide sample according to a predetermined flow ordering;obtaining a series of signals resulting from the introducing ofnucleotides to the polynucleotide sample; and resolving the series ofsignals over the barcode sequence to render a flowspace string, whereinthe flowspace string is a codeword of an error-tolerant code capable ofdistinguishing the barcode sequence from other barcode sequences in thepresence of one or more errors. In such a method, the error-tolerantcode may be an error-correcting code. The error-correcting code may be aternary code using a three character alphabet. The first character ofthe alphabet may represent a 0-mer signal, the second character in thealphabet may represent a 1-mer signal, and the third character in thealphabet may represent a 2-mer signal. The ternary code may be a ternaryHamming code. The ternary code may be a ternary Golay code. The codewordmay have a length of 15 or fewer characters. The error-correcting codemay be capable of correcting up to two errors in the flowspace string.The error-correcting code may be capable of correcting one error in theflowspace string. The barcode sequence may have a length in the range of5-15 bases. The barcode sequence may have a GC content in the range of40-60%. The method may further comprise: determining whether theflowspace string contains an error; and correcting the flowspace stringif an error is detected. The method may further comprise translating theflowspace string into a base sequence to obtain the sequence of thebarcode sequence. The method may further comprise: decoding theflowspace string to the correct flowspace codeword; and translating theflowspace codeword into a base sequence to obtain the sequence of thebarcode sequence. The method may further comprise providing multipledifferent polynucleotide samples for multiplex sequencing by the seriesof flowed nucleotides, each different polynucleotide sample comprising adifferent barcode sequence that gives a different flowspace string, eachdifferent flowspace string being a different codeword of theerror-correcting code. Any two flowspace codewords of theerror-correcting code may have a Hamming distance of at least 3. Any twoflowspace codewords of the error-correcting code may have a Hammingdistance of at least 5. The method may further comprise one or both of:(i) detecting hydrogen ions released by the incorporation of nucleotidesinto the polynucleotide sample, wherein the amplitude of the signals isrelated to the amount of hydrogen ions detected, and (ii) detectinginorganic pyrophosphate released by the incorporation of nucleotidesinto the polynucleotide sample, wherein the amplitude of the signals isrelated to the amount of inorganic pyrophosphate detected. There is alsoprovided a non-transitory machine-readable storage medium comprisinginstructions which, when executed by a processor, cause the processor toperform such a method and variants thereof. There is also provided asystem, including: a machine-readable memory; and a processor configuredto execute machine-readable instructions, which, when executed by theprocessor, cause the system to perform such a method and variantsthereof.

According to an embodiment, there is provided a method of making apolynucleotide, comprising: constructing an error-tolerant codeconsisting of a set of codewords; translating each codeword to a samplediscriminating code sequence for a given nucleotide flow ordering; andobtaining a polynucleotide containing the sample discriminating codesequence. The polynucleotide may be an oligonucleotide that is 5-40bases in length. The step of obtaining the polynucleotide may comprisesynthesizing the polynucleotide. There is also provided a polynucleotidemade according to the method comprising: constructing an error-tolerantcode consisting of a set of codewords; translating each codeword to asample discriminating code sequence for a given nucleotide flowordering; and obtaining a polynucleotide containing the samplediscriminating code sequence. The sample discriminating code sequencesmay be a barcode sequence. The error-tolerant code may be anerror-correcting code. The codewords may be translated to barcodesequences for a given nucleotide flow ordering. The error-correctingcode, the codewords, and/or the barcode sequences may be designed orselected in any manner described herein. A polynucleotide containing thebarcode sequence may be made using any conventional polynucleotidesynthesis technique known in the art or can be supplied by commercialsources that provide custom-made polynucleotides of the desiredsequence.

According to an exemplary embodiment, there is provided a method forgenerating sample discriminating code sequences, comprising:constructing an error-tolerant code consisting of a set of codewords;and translating each codeword to a sample discriminating code sequencefor a given nucleotide flow ordering. The error-tolerant code may be anerror-correcting code. The codewords may be translated to barcodesequences for a given nucleotide flow ordering. The error-correctingcode, the codewords, and/or the barcode sequences may be designed orselected in any manner described herein. There is also provided a systemfor generating sample discriminating code sequences, comprising: amachine-readable memory; and a processor configured to executemachine-readable instructions, which, when executed by the processor,cause the system to perform a method for generating samplediscriminating code sequences, comprising: constructing anerror-tolerant code consisting of a set of codewords; receiving apredetermined nucleotide flow ordering; and translating each codeword toa sample discriminating code sequence according to the predeterminednucleotide flow ordering. There is also provided a non-transitorymachine-readable storage medium comprising instructions which, whenexecuted by a processor, cause the processor to perform a method forgenerating sample discriminating code sequences, comprising:constructing an error-tolerant code consisting of a set of codewords;receiving a predetermined nucleotide flow ordering; and translating eachcodeword to a sample discriminating code sequence according to thepredetermined nucleotide flow ordering. The sample discriminating codesequences may be a barcode sequence.

According to an exemplary embodiment, there is provided a sequencingkit, comprising: a plurality of different polynucleotides, eachdifferent polynucleotide comprising a different molecular samplediscriminating code sequence, wherein a flowspace projection of eachdifferent molecular sample discriminating code sequence under a flow ofa series of nucleotides according to a predetermined flow ordering givesdifferent flowspace strings that are codewords of an error-tolerantcode. The plurality of different polynucleotides may include at least20, 30, 40, 50, 60, 70, 80, 90, 100, or a larger integer of, differentpolynucleotides. The different molecular sample discriminating codesequences may be different molecular barcode sequences. The sequencingkit may further comprise a polymerase enzyme. The sequencing kit mayfurther comprise multiple containers for holding the differentpolynucleotides, and each different polynucleotide may be held in adifferent container. The polynucleotides may be oligonucleotides of 5-40bases in length. The sequencing kit may further comprise multipledifferent kinds of nucleotide monomers. The sequencing kit may furthercomprise a ligase enzyme.

According to an exemplary embodiment, there is provided a sequencingkit, comprising: multiple different polynucleotides (which may becontained in vials, for example), each different polynucleotidecomprising a different barcode sequence as described herein. Thepolynucleotides may be oligonucleotides having 5-40 bases. Thesequencing kit may contain at least 16 different polynucleotides, eachcomprising a different barcode sequence; or at least 32 differentpolynucleotides; or at least 48 different polynucleotides; or at least96 different polynucleotides, or more, for example. Each differentpolynucleotide may be provided in a separate container (e.g., a separatevial). The polynucleotides may be the barcode sequences themselves, orthey may further include other elements, such as primer sites, adaptors,ligating sites, linkers, etc. The sequencing kit may also include a setof precursor nucleotide monomers for carrying outsequencing-by-synthesis operations, for example, and/or various otherreagents involved in a workflow for preparing and/or sequencing asample.

According to an exemplary embodiment, there is provided a pool ofdifferent polynucleotide strands, each different polynucleotide strandcomprising a different barcode sequence; wherein the flowspaceprojection of each different barcode sequence under a flow of a seriesof nucleotides according to a predetermined flow ordering givesdifferent flowspace strings that are codewords of an error-tolerantcode. The error-tolerant code may be an error-correcting code. Theerror-correcting code may be a block code. The block code may be aternary code using a three-character alphabet. The barcode sequence maycontain only 1-mer and 2-mer base sequences, and a first character ofthe alphabet may represent a 0-mer signal, a second character in thealphabet may represent a 1-mer signal, and a third character in thealphabet may represent a 2-mer signal. The block code may be a ternaryGolay code. The block code may be a ternary Hamming code. The block codemay be a binary code using a two-character alphabet. The barcodesequence may contain only 1-mer bases, and a first character of thealphabet may represent a 0-mer signal, and a second character in thealphabet may represent a 1-mer signal. Each codeword may have a lengthof 15, 14, 13, 12, 11, 10, or fewer digits. The error-correcting codemay be configured to and/or capable of correcting up to two errors inthe flowspace string. The error-correcting code may be configured toand/or capable of correcting one error in the flowspace string. Thebarcode sequences may have a length of 20, 19, 18, 17, 16, 15, 14, 13,12, 11, 10, or fewer bases. The barcode sequences may have a GC contentin the range of 40-60%. The error-correcting code may have a minimumdistance of 3. The error-correcting code may have a minimum distance of5. The polynucleotide pool may be such that any two different flowspacestrings for the barcode sequences have a Hamming distance of at least 3.The polynucleotide pool may be such that any two different flowspacestrings for the barcode sequences have a Hamming distance of at least 5.The polynucleotide pool may contain at least 96 different polynucleotidestrands and the error-correcting code may have at least 96 differentcodewords.

FIG. 9 illustrates a pool of seven different polynucleotide strands,each associated with a unique barcode sequence. Each polynucleotidestrand may have a primer site, a standard key sequence, and a uniquebarcode sequence. Each polynucleotide strand also may have a differenttarget sequence. Such a pool of polynucleotide strands may be subject tomultiplex sequencing and the barcodes (with any read corrections asneeded) may help identify the source of the sequence data derived from amultiplex sample.

According to an exemplary embodiment, there is provided a sampleidentification kit, comprising: a plurality of sample discriminatingcodes, wherein: a) each sample discriminating code comprises a sequenceof individual subunits; b) the sequence of subunits of each samplediscriminating code is distinguishable from the sequence of individualsubunits of each other member of the plurality of sample discriminatingcodes; and c) each sample discriminating code is tolerant to one or moreerrors so as to be discretely resolvable with respect to other samplediscriminating codes.

According to an exemplary embodiment, there is provided a sampleidentification kit, comprising: a plurality of sample discriminatingcodes, wherein: a) each sample discriminating code comprises a sequenceof individual subunits; b) a detectable signal is associated with eachsubunit or with pairs or sets of subunits such that each samplediscriminating code is associated with a sequence of detectable signals;c) each sequence of detectable signals is distinguishable from thesequence of detectable signals of each other member of the plurality ofsample discriminating codes; and d) the sequence of detectable signalsof each sample discriminating code is tolerant to at least one error soas to be discretely resolvable with respect to other samplediscriminating codes.

FIGS. 10A-10C illustrate an exemplary workflow for preparing a multiplexsample. FIG. 10A shows an exemplary construction of a genomic DNAfragment library. A bacterial genomic DNA 10 may be fragmented into manyDNA fragments 12 using any suitable technique, such as sonication,mechanical shearing, or enzymatic digestion, for example.Platform-specific adaptors 14 may then be ligated onto the ends of thefragments 12. Referring to FIG. 10B, each fragment sample 18 may then beisolated and combined with a bead 16. To allow for identification of thefragment 18, a barcode sequence (not shown in the figure) may be ligatedto the fragment 18. The fragment 18 may then be clonally amplified ontothe bead 16, resulting in many clonal copies of the fragment 18 on thebead 16. This process may repeated for each different fragment 12 of thelibrary, resulting in many beads, each having the product of a singlelibrary fragment 12 amplified many times. Referring to FIG. 10C, thebeads 16 may then be loaded onto a microwell array. FIG. 10C shows apartial view of a DNA fragment inside a well as it is undergoingsequencing reactions. A template strand 20 may be paired with a growingcomplementary strand 22. In the left panel, an A nucleotide is added tothe microwell, resulting in a single-base incorporation event, whichgenerates a single hydrogen ion. In the right panel, a T nucleotide isadded to the microwell, resulting in a two-base incorporation event,which generates two hydrogen ions. The signal produced by the hydrogenions are shown as peaks 26 in the ionograms. In various embodiments, asequencing kit may contain one or more of the materials needed for theabove sample preparation and sequencing workflow, including reagents forperforming DNA fragmentation, adaptors, primers, ligase enzymes, beadsor other solid support, polymerase enzymes, or precursor nucleotidemonomers for the incorporation reactions.

According to an exemplary embodiment, there is provided a system,comprising a plurality of identifiable nucleic acid barcodes. Thenucleic acid barcodes may be attached to, or associated with, targetnucleic acid fragments to form barcoded target fragments. A library ofbarcoded target fragments may include a plurality of a first barcodeattached to target fragments from a first source. Alternatively, alibrary of barcoded target fragments may include different identifiablebarcodes attached to target fragments from different sources to make amultiplex library. For example, a multiplex library may include amixture of a plurality of a first barcode attached to target fragmentsfrom a first source, and a plurality of a second barcode attached totarget fragments from a second source. In the multiplex library, thefirst and second barcodes may be used to identify the source of thefirst and second target fragments, respectively. Any number of differentbarcodes may be attached to target fragments from any number ofdifferent sources. In a library of barcoded target fragments, thebarcode portion may be used to identify: a single target fragment; asingle source of the target fragments; a group of target fragments;target fragments from a single source; target fragments from differentsources; target fragments from a user-defined group; or any othergrouping that may require or benefit from identification. The sequenceof the barcoded portion of a barcoded target fragment may be separatelyread from the target fragment, or read as part of a larger read spanningthe barcode and the target fragment. In a sequencing experiment, thenucleic acid barcode may be sequenced with the target fragment and thenparsed algorithmically during processing of the sequencing data. Invarious embodiments, a nucleic acid barcode may comprise a synthetic ornatural nucleic acid sequence, DNA, RNA, or other nucleic acids and/orderivatives. For example, a nucleic acid barcode may include nucleotidebases adenine, guanine, cytosine, thymine, uracil, inosine, or analogsthereof. Such barcodes may serve to identify a polynucleotide strandand/or distinguish it from other polynucleotide strands (e.g., thosecontaining a different target sequence of interest), and may be used forvarious purposes, such as tracking, sorting, and/or identifying thesamples, for example. Because different barcodes can be associated withdifferent polynucleotide strands, such barcodes may be useful inmultiplexed sequencing of different samples.

Multiplex Libraries

In various embodiments, there are provided sample discriminating codesor barcodes (e.g., nucleic acid barcodes) that may be attached to, orassociated with, targets (e.g., nucleic acid fragments) to generatebarcoded libraries (e.g., barcoded nucleic acid libraries). Suchlibraries may be prepared using one or more suitable nucleic acid orbiomolecule manipulation procedures, including: fragmenting;size-selecting; end-repairing; tailing; adaptor-joining; nicktranslation; and purification, for example. In various embodiments,nucleic acid barcodes may be attached to, or associated with, fragmentsof a target nucleic acid sample using one or more suitable procedure,including ligation, cohesive-end hybridization, nick-translation, primerextension, or amplification, for example. In some embodiments, nucleicacid barcodes may be attached to a target nucleic acid usingamplification primers having a particular barcode sequence.

In various embodiments, a target nucleic acid or biomolecule (e.g.,proteins, polysaccharides, and nucleic acids, and their polymersubunits, etc.) sample may be isolated from any suitable source, such assolid tissue, tissue, cells, yeast, bacteria, or similar sources, forexample. Any suitable methods for isolating samples from such sourcesmay be used. For example, solid tissue or tissue may be weighed, cut,mashed, homogenized, and the sample may be isolated from homogenizedsamples. An isolated nucleic acid sample may be chromatin, which may becross-linked with proteins that bind DNA, in a procedure known as ChIP(chromatin immunoprecipitation). In some embodiments, samples may befragmented using any suitable procedure, including cleaving with anenzyme or chemical, or by shearing. Enzyme cleavage may include any typeof restriction endonuclease, endonuclease, or transposase-mediatedcleavage.

Fragment Libraries

In various embodiments, there are provided fragment libraries, which maycomprise: a first priming site (P1); a second priming site (P2); aninsert; an internal adaptor (IA); and a barcode (BC). In someembodiments, a fragment library may include constructs having certainarrangements, such as: a P1 priming site, an insert, an internal adaptor(IA), a barcode (BC), and a P2 priming site. In some embodiments, thefragment library may be attached to a solid support, such as a bead.

FIG. 11 illustrates an exemplary beaded template. It shows an exemplarynucleic acid attached to a solid support, such as a bead. A beadedtemplate 700 includes a bead 710 having a linker 720, which is asequence for attaching a template 730 to the solid support. The template730 may include a first or P1 priming site 740, an insert 750, and asecond or P2 priming site 760. The template 730 may be a synthetictemplate. The template 730 may be representative of a fragment library.The template 730 may comprise a nucleic acid barcode BC, which may bepositioned between the P1 priming site 740 and the insert 750, forexample. An internal adaptor may be placed between the P1 priming site740 and the barcode BC, or between the barcode BC and the insert 750, orbetween the insert 750 and the P2 priming site 760.

FIG. 12 illustrates another exemplary beaded template. The nucleic acidbarcode BC may be positioned between the insert 750 and the P2 primingsite 760. An internal adaptor may be placed between the P1 priming site740 and the insert 750, or between the insert 750 and the barcode BC, orbetween the barcode BC and the P2 priming site 760.

In various embodiments, the length of the linker 720 and template 730may vary. For example, the length of the linker 720 may range from 10 to100 bases, for example, or from 15 to 45 bases, for example, and may be18 bases (18 b) in length, for example. The template 730, whichcomprises the P1 priming site 740, the insert 750, and the P2 primingsite 760, may also vary in length. For example, the P1 priming site 740and the P2 priming site 760 may each range from 10 to 100 bases, forexample, or from 15 to 45 bases, for example, and may be 23 bases (23 b)in length, for example. The insert 750 may range from 2 bases (2 b) to20,000 bases (20 kb), for example, and may be 60 bases (60 b), forexample. In an embodiment, the insert 750 may comprise more than 100bases, such as, e.g., 1,000 or more bases. In various embodiments, theinsert may be in the form of a concatenate, in which case, the insert750 may comprise up to 100,000 bases (100 kb) or more.

In various embodiments, the position of barcode BC may be selected basedon various considerations, including the length of the insert,signal-to-noise ratio issues, and/or sequencing bias issues. Forexample, where signal-to-noise ratio may be an issue (e.g., thesignal-to-noise ratio can decrease as additional ligation cycles areperformed in sequencing-by-ligation, for example), the barcode BC may bepositioned adjacent the P1 priming site 740 to avoid potential errorsdue to a diminished signal-to-noise ratio, and where the signal-to-noiseratio may not be a significant issue, the barcode BC may be placedadjacent to either the P1 priming site 740 or the P2 priming site 760.In some cases, template sequences may interact differently with a probesequence used during the sequencing experiment. Placing the barcode BCbefore the insert 750 can affect the sequencing results for the insert750. Positioning the barcode BC after the insert 750 can decreasesequencing errors due to bias. Generally, the position of the barcodecan be affected by or affect sequencing, and the position that bestachieves desired results based on the conditions of the sequencingprocess may be selected.

In various embodiments, sequencing and decoding of a nucleic acidbarcode may be performed with a single forward direction sequence read(e.g., 5′-3′ direction along the template), e.g., reading the barcode BCand the insert 750 in a single read. In an embodiment, the forward readmay be parsed into the barcode portion and the insert portionalgorithmically.

In various embodiments, sample discriminating codes or barcodes may beattached to polymers, such as proteins, and may be a polypeptideattached to a protein. Intein-mediated ligation may join togetherseparate proteins or polypeptides. For example, expressed proteinligation (EPL) involves a native chemical ligation (NCL) reactionbetween an intein-fusion protein and protein having an N-Cys. In anotherexample, protein trans-splicing involves reconstitution of two halves ofan intein protein (see Dawson et al., “Synthesis of proteins by nativechemical ligation,” Science, 266:776-779 (1994); Muir, “Semisynthesis ofProteins by Expressed Protein Ligation,” Ann. Rev. Biochem., 72:249-289(2003); Paulus, “Protein Splicing and Related Forms of ProteinAutoprocessing,” Ann. Rev. Biochem., 69:447-496 (2000); and Muralidharanet al., “Protein ligation: an enabling technology for the biophysicalanalysis of proteins,” Nature Methods, 3:429-438 (2006)).

Mate Pair Libraries

In various embodiments, nucleic acid barcodes as described herein mayalso be used in templates derived from a mate pair library. FIG. 13illustrates an exemplary mate pair beaded template. It shows a beadedtemplate 300 comprising a bead 310, a linker 320, and a template 330.The template 330 may comprise a first or P1 priming site 340 and secondor P2 priming site 360, each of which may range in length from 10 to 100bases, for example, or from 15 to 45 bases, for example, and may be 23bases in length, for example. The template 330 may further comprise aninsert 350, which may comprise a first tag sequence 352, a second tagsequence 354, and an internal adapter 356 located between the first andsecond tag sequences 352 and 354. In various embodiments, the barcode BCmay be placed between the second tag sequence 354 and the P2 primingsite 360. Other positions are possible. The first and second tagsequences 352 and 354 may each have a length ranging from 2 bases (2 b)to 20,000 bases (20 kb), for example, and may have a length of 60 bases,for example; they may be the same sequence or different sequences; andthey may comprise a different number of bases or the same number ofbases. The internal adapter 356, which may be common to all of thetemplate sequences, may have a length ranging from 10 to 100 bases, forexample, or from 15 to 45 bases, for example, and may have a length of36 bases, for example.

In various embodiments, the barcode BC may be positioned based on theconditions of the sequencing process. For example, the barcode BC may bepositioned between the P1 priming site 340 and a first tag sequence 352,as shown in FIG. 13, or between a second tag sequence 354 and the P2priming site 360. Alternatively, the barcode BC may be positionedadjacent an internal adapter 356 and either a first tag sequence 352 ora second tag sequence 354. In another embodiment, the barcode BC may beintegrated within an internal adapter 356.

In various embodiments, nucleic acid barcodes as described herein may beincorporated into an extended oligonucleotide comprising the barcode andone or more of a P1 primer, a P2 primer, and an internal adapter. Forexample, the barcode may be incorporated into an oligonucleotidecomprising the P2 primer, the barcode, and an internal adapter, whichmay allow the barcode to be sequenced in a separate read. In variousembodiments, the barcode may be incorporated into other oligonucleotidesor arrangements of oligonucleotides.

In various embodiments, nucleic acid barcodes as described herein may beadded to libraries using any suitable method. For example, full-lengthdouble-stranded oligonucleotide pairs specific for each barcode may beannealed and ligated onto double-stranded nucleic acid fragments. Inanother example, one full-length double-stranded oligonucleotide may beannealed to one short universal oligonucleotide specific for eachbarcode and ligated onto double-stranded nucleic acid fragments. In afurther example, a universal oligonucleotide adapter may be ligated ontosingle-stranded RNA, converted into double-stranded DNA, and the barcodemay be added using a barcode-specific PCR primer during libraryamplification.

In various embodiments, nucleic acid barcodes as described herein may beadapted for use in generating mate pair libraries for nucleic acidsequencing. For example, the barcodes may be used in SOLiD®. Mate-PairedLibrary Construction Kits by Applied Biosystems (now Life TechnologiesCorp.). In various embodiments, the P2 adaptor may be replaced with amultiplex adaptor having three portions: an internal primer bindingsequence; a barcode sequence; and a P2 primer binding sequence.

In an embodiment, the first tag sequence 352 may be a first sheared DNAtag sequence and the second tag sequence 354 may be a second sheared DNAtag sequence 354. Because the internal adaptor sequence may be locatedin between the two tag sequences 352 and 354, an alternative sequencemay be used to prime the sequencing of the barcode BC.

According to an embodiment, there is provided a method for constructinga mate pair library using one or more barcodes as described hereinpositioned adjacent a P2 primer, comprising the following steps (whichmay be performed in addition to other routine library creation stepsknown to those ordinarily skilled in the art): (1) generating DNAfragments by shearing a DNA sample and repairing the ends; (2) ligatingLMP CAP adaptors to the ends of the fragmented DNA; (3) circularizingthe DNA with an internal adaptor which leaves nicks; (4) conducting anick translation reaction to move the position of the nicks to a newposition that is within the DNA fragment (the timing of the nicktranslation reaction may be stopped to place the nick at any desiredposition along the DNA fragment); (5) digesting the nick translated DNAwith T7 exonuclease and S1 nuclease to release the linear,double-stranded mate pair tags; and (6) ligating multiplex P1 and P2barcoded adaptors to the mate pair tags.

In an embodiment, such a mate pair library may be amplified, quantitatedby qPCR or other method, and/or pooled. In various embodiments, beadsmay be templated with the mate pair library by emulsion PCR, and theresulting beads may be sequenced. In such a library, the P1 and IA endof the insert sequences and the barcode may be sequenced in threeseparate reads from the same strand. The barcode may be sequenced usingbarcode adaptor sequences having P2, barcode, and priming sequences. Anysuitable priming sequences may be used.

Paired End Libraries

In various embodiments, nucleic acid barcodes as described herein may beadapted for use in generating paired end libraries. Generally, pairedend libraries may be constructed by: fragmenting a starting source ofDNA (e.g., shearing); and attaching P1 adaptors and barcoded P2 adaptorsto the ends of the fragments. The paired end library may be amplifiedand sequenced. In an embodiment, the paired ends and the barcodes in thepaired end library may be sequenced in separate reads from the samestrand.

SAGE™ Libraries

In various embodiments, nucleic acid barcodes as described herein may beadapted to construct a nucleic acid library for use in gene expressionanalysis using nucleic acid sequencing. For example, the nucleic acidbarcodes may be used in SOLiD®. SAGE™ gene expression analysis (whereSAGE™ is Serial Analysis of Gene Expression, developed by AppliedBiosystems (now Life Technologies Corp.)).

In various embodiments, nucleic acid barcodes as described herein may bedesigned to lack one or more restriction enzyme recognition sequence(s),amplification sequences, or adaptor sequences that are used forconstructing the nucleic acid library. For example, in SAGE™, arecognition site for the restriction enzyme EcoP15I is used to generateSAGE™ tags. Nucleic acid barcodes used in SAGE™, other gene expressionanalysis, or other analyses reliant on recognition sites for restrictionenzymes, for example, may be designed to avoid recognition sitesnecessary for the further analysis carried out in those processes.

In various embodiments, SAGE™-compatible nucleic acid barcodes may bedesigned to be positioned adjacent the P1 primer. SAGE™ tags have a2-base overhang resulting from EcoP15I cleavage. To account for theoverhang, the nucleic acid barcode may comprise an overhang end having1, 2, 3, 4, 5, or longer overhang end. The overhang end may include adegenerate sequence. The nucleic acid barcode may include a 2-nucleotidedegenerate extension to ligate to the SAGE™ tag. Alternatively, the2-base overhang on the SAGE™ tag may be degraded or filled-in to producea blunt end for ligating to the nucleic acid barcode.

FIG. 14A illustrates an exemplary barcoded adaptor. It shows a barcodeBC attached to a P1 primer 440, wherein the barcode BC comprises a2-nucleotide degenerate extension NN. A P2 primer may be adapted toligate properly to the SAGE™ tag. The P2 primer may have an NlaIIIoverhang (GTAC) attached to an EcoP15I recognition site to ligate to theSAGE™ tag. FIG. 14B illustrates an exemplary beaded template. It shows aSAGE™ tag 450 ligated to barcode BC and the NlaIII overhang 462 andEcoP15I recognition site 464, which are ligated to P2 primer 460. P1primer 440 is attached to solid support 410 (e.g., bead) through linker420.

FIG. 15 illustrates another exemplary beaded template. It shows abarcode BC positioned adjacent a P2 primer 560. Primer P1 540 isattached to a solid support 510 (e.g., a bead) through linker 520. A2-nucleotide degenerate overhang NN allows a SAGE™ tag 550 to ligate tothe P1 primer 540. On the other side of the SAGE™ tag 550, an internaladapter IA is ligated to an EcoP15I recognition site 564 and an NlaIIIoverhang 562. The barcode may be incorporated in an oligonucleotidecomprising one or more oligonucleotide sequences, such as, e.g., aninternal adapter and a P2 primer, or comprising a modified internaladapter, the barcode, and a P2 primer. In various embodiments, thebarcode need not be part of the library construct, but can be introducedby PCR amplification using a primer having the barcode sequence.

According to an embodiment, there is provided a method for generatingbarcoded SAGE™ libraries using one or more barcodes as described hereinpositioned adjacent the P2 primer, comprising the following steps (whichmay be performed in addition to other routine library creation stepsknown to those ordinarily skilled in the art): (1) generating animmobilized cDNA library from poly-A RNA; (2) digesting the cDNA with arestriction enzyme to create cohesive ends for EcoP151 ends (e.g.,digesting with Nla III); (3) ligating to the Nla III cut ends aninternal adaptor having cohesive ends for EcoP151 to form an EcoP151recognition site; (4) cleaving the EcoP15I site to generate SAGE™ tagfragments; (5) ligating P1 adaptors (e.g., SAGE™-specific P1 adaptorshave a 2-base degenerate extension to hybridize with the overhang fromthe cleaved EcoP15I ends); and (6) amplifying the library (e.g., PCRusing primers having a P2 adaptor and barcode sequences). Any suitableprimers may be used.

In various embodiments, such a library may be amplified, quantitated byqPCR or other method, and/or pooled. In various embodiments, beads maybe templated with the SAGE™ library by emulsion PCR, and the templatedbeads may be sequenced.

Yeast Libraries

In various embodiments, nucleic acid barcodes as described herein may beused in combination with other yeast barcodes, including those describedin Yan et al., “Yeast Barcoders: a chemogenomic application of auniversal donor-strain collection carrying bar-code identifiers,” NatureMethods, 5(8):719-725 (2008), for example, which is incorporated byreference herein in its entirety. Yeast barcodes may be unique sequencesidentifying about 6,000 S. cerevisiae gene deletion strains, and maycomprise a signature sequence of about 20 bases flanked by conserved PCRprimer sequences. In an embodiment, a set of barcodes comprising about100 uniquely identifiable barcodes as described herein may be used incombination with 6,000 other yeast barcodes to yield about 600,000targets to be analyzed (e.g., per location on a slide when using aSOLiD® sequencing platform or using some other support, e.g., an IonPGM™ or Ion Proton™ chip and sequencing platform, for example). Forexample, a SOLiD® slide would provide capacity for about 4.8 milliontargets, and using both slides in a SOLiD® apparatus would allow about9.6 million targets to be analyzed simultaneously. Many more targetscould be analyzed simultaneously using Ion PGM™ 314™, 316™, or 318™.Chips (which have about 1, 6, and 11 million wells, respectively), anIon Proton™ Chip (which is expected to have about 165 million wells),and/or an Ion Proton™ II Chip (which is expected to have about 660million wells). In an embodiment, a sequencing experiment may beperformed in which chemical compounds are tested against each of the6,000 S. cerevisiae gene deletion strains. Each chemical compound may beidentified by a uniquely identifiable barcode as described herein, andeach of the 6,000 S. cerevisiae gene deletion strains may be identifiedby a uniquely identifiable yeast barcode.

In various embodiments, nucleic acid barcodes as described herein may becombined with at least one yeast barcode to prepare a module to beanalyzed. The module may comprise a first conserved PCR primer adjacenta P1 primer. The nucleic acid barcode may be ligated to a P2 primerbetween the P2 primer and a second conserved PCR primer. An internaladapter may be positioned between the nucleic acid barcode and thesecond conserved PCR primer. In an embodiment, the complete nucleic acidsequence may comprise a P1 primer, a first conserved PCR primer, aninsert with a yeast barcode, a second conserved PCR primer, an internaladapter, a nucleic acid barcode as described herein, and a P2 primer.Any suitable primers may be used.

ChIP-Seq Libraries

In various embodiments, nucleic acid barcodes as described herein may beadapted for use in Chromatin immunoprecipitation (ChIP) sequencing togenerate ChIP-based libraries. ChIP involves isolating genomic nucleicacids that are associated with DNA-binding proteins. Chromatin/proteincomplexes may be isolated using a SOLiD® ChIP-Seq Kit from AppliedBiosystems (now part of Life Technologies Corp.). Isolatedchromatin/protein complexes can be manipulated and ligated to nucleicacid barcodes and related adaptors to construct a ChIP-based library.General steps for ChIP include: (1) treating live cells or tissue withformaldehyde to crosslink proximal molecules to create protein/DNAcomplexes; (2) lysing the cells to release the cross-linked complexes;(3) fragmenting the DNA (e.g., via sonication); (4) immunoprecipitatingthe protein/DNA complex of interest using certain antibodies conjugatedto beads; (5) releasing the DNA from the cross-linked complex by heattreatment; and (6) purifying the released DNA. General steps forpreparing a ChIP-based library include: (1) generating cohesive ends onthe ChIP-isolated DNA (e.g., end-repair); and (2) attaching P1, P2and/or coded adaptors (e.g., with nucleic acid barcodes) to the ends ofthe ChIP-isolated DNA. Nick translation can be performed on theadaptor-ligated DNA to close any gaps or nicks between the DNA fragmentand the adaptors. The ChIP-based library may include fragments ofchromatin ligated at the ends with any combination of P1, P2, and/orcoded adaptors (e.g., with nucleic acid barcodes).

According to various embodiments, one or more features of any one ormore of the above-discussed teachings and/or embodiments may beperformed or implemented using appropriately configured and/orprogrammed hardware and/or software elements. Determining whether anembodiment is implemented using hardware and/or software elements may bebased on any number of factors, such as desired computational rate,power levels, heat tolerances, processing cycle budget, input datarates, output data rates, memory resources, data bus speeds, etc., andother design or performance constraints.

Examples of hardware elements may include processors, microprocessors,input(s) and/or output(s) (I/O) device(s) (or peripherals) that arecommunicatively coupled via a local interface circuit, circuit elements(e.g., transistors, resistors, capacitors, inductors, and so forth),integrated circuits, application specific integrated circuits (ASIC),programmable logic devices (PLD), digital signal processors (DSP), fieldprogrammable gate array (FPGA), logic gates, registers, semiconductordevice, chips, microchips, chip sets, and so forth. The local interfacemay include, for example, one or more buses or other wired or wirelessconnections, controllers, buffers (caches), drivers, repeaters andreceivers, etc., to allow appropriate communications between hardwarecomponents. A processor is a hardware device for executing software,particularly software stored in memory. The processor can be any custommade or commercially available processor, a central processing unit(CPU), an auxiliary processor among several processors associated withthe computer, a semiconductor based microprocessor (e.g., in the form ofa microchip or chip set), a macroprocessor, or generally any device forexecuting software instructions. A processor can also represent adistributed processing architecture. The I/O devices can include inputdevices, for example, a keyboard, a mouse, a scanner, a microphone, atouch screen, an interface for various medical devices and/or laboratoryinstruments, a bar code reader, a stylus, a laser reader, aradio-frequency device reader, etc. Furthermore, the I/O devices alsocan include output devices, for example, a printer, a bar code printer,a display, etc. Finally, the I/O devices further can include devicesthat communicate as both inputs and outputs, for example, amodulator/demodulator (modem; for accessing another device, system, ornetwork), a radio frequency (RF) or other transceiver, a telephonicinterface, a bridge, a router, etc.

Examples of software may include software components, programs,applications, computer programs, application programs, system programs,machine programs, operating system software, middleware, firmware,software modules, routines, subroutines, functions, methods, procedures,software interfaces, application program interfaces (API), instructionsets, computing code, computer code, code segments, computer codesegments, words, values, symbols, or any combination thereof. A softwarein memory may include one or more separate programs, which may includeordered listings of executable instructions for implementing logicalfunctions. The software in memory may include a system for identifyingdata streams in accordance with the present teachings and any suitablecustom made or commercially available operating system (O/S), which maycontrol the execution of other computer programs such as the system, andprovides scheduling, input-output control, file and data management,memory management, communication control, etc.

According to various embodiments, one or more features of any one ormore of the above-discussed teachings and/or embodiments may beperformed or implemented using appropriately configured and/orprogrammed non-transitory machine-readable medium or article that maystore an instruction or a set of instructions that, if executed by amachine, may cause the machine to perform a method and/or operations inaccordance with the embodiments. Such a machine may include, forexample, any suitable processing platform, computing platform, computingdevice, processing device, computing system, processing system,computer, processor, scientific or laboratory instrument, etc., and maybe implemented using any suitable combination of hardware and/orsoftware. The machine-readable medium or article may include, forexample, any suitable type of memory unit, memory device, memoryarticle, memory medium, storage device, storage article, storage mediumand/or storage unit, for example, memory, removable or non-removablemedia, erasable or non-erasable media, writable or re-writable media,digital or analog media, hard disk, floppy disk, read-only memorycompact disc (CD-ROM), recordable compact disc (CD-R), rewritablecompact disc (CD-RW), optical disk, magnetic media, magneto-opticalmedia, removable memory cards or disks, various types of DigitalVersatile Disc (DVD), a tape, a cassette, etc., including any mediumsuitable for use in a computer. Memory can include any one or acombination of volatile memory elements (e.g., random access memory(RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements(e.g., ROM, EPROM, EEROM, Flash memory, hard drive, tape, CDROM, etc.).Moreover, memory can incorporate electronic, magnetic, optical, and/orother types of storage media. Memory can have a distributed architecturewhere various components are situated remote from one another, but arestill accessed by the processor. The instructions may include anysuitable type of code, such as source code, compiled code, interpretedcode, executable code, static code, dynamic code, encrypted code, etc.,implemented using any suitable high-level, low-level, object-oriented,visual, compiled and/or interpreted programming language.

According to various embodiments, one or more features of any one ormore of the above-discussed teachings and/or embodiments may beperformed or implemented at least partly using a distributed, clustered,remote, or cloud computing resource.

According to various embodiments, one or more features of any one ormore of the above-discussed teachings and/or embodiments may beperformed or implemented using a source program, executable program(object code), script, or any other entity comprising a set ofinstructions to be performed. When a source program, the program can betranslated via a compiler, assembler, interpreter, etc., which may ormay not be included within the memory, so as to operate properly inconnection with the O/S. The instructions may be written using (a) anobject oriented programming language, which has classes of data andmethods, or (b) a procedural programming language, which has routines,subroutines, and/or functions, which may include, for example, C, C++,Pascal, Basic, Fortran, Cobol, Perl, Java, and Ada.

According to various embodiments, one or more of the above-discussedembodiments may include transmitting, displaying, storing, printing oroutputting to a user interface device, a computer readable storagemedium, a local computer system or a remote computer system, informationrelated to any information, signal, data, and/or intermediate or finalresults that may have been generated, accessed, or used by suchembodiments. Such transmitted, displayed, stored, printed or outputtedinformation can take the form of searchable and/or filterable lists ofruns and reports, pictures, tables, charts, graphs, spreadsheets,correlations, sequences, and combinations thereof, for example.

Various other embodiments may be derived by repeating, adding, orsubstituting any generically or specifically described features and/orcomponents and/or substances and/or steps and/or operating conditionsset forth in one or more of the above-described embodiments. Further, itshould be understood that an order of steps or order for performingcertain actions is immaterial so long as the objective of the steps oraction remains achievable, unless specifically stated otherwise.Furthermore, two or more steps or actions can be conductedsimultaneously so long as the objective of the steps or action remainsachievable, unless specifically stated otherwise. Moreover, any one ormore feature, component, aspect, step, or other characteristic mentionedin one of the above-discussed embodiments may be considered to be apotential optional feature, component, aspect, step, or othercharacteristic of any other of the above-discussed embodiments so longas the objective of such any other of the above-discussed embodimentsremains achievable, unless specifically stated otherwise.

Although various embodiments of the present teachings may advantageouslybe used with sequencing-by-synthesis approaches, as described herein andin Rothberg et al., U.S. Pat. Publ. No. 2009/0026082; Anderson et al.,SENSORS AND ACTUATORS B CHEM., 129:79-86 (2008); Pourmand et al., PROC.NAT1. ACAD. SCI., 103:6466-6470 (2006), which are all incorporated byreference herein in their entirety, for example, the present teachingsmay also be used with other approaches, such as variants ofsequencing-by-synthesis including methods where the nucleotides ornucleoside triphosphate precursors are modified to be reversibleterminators (sometimes referred to as cyclic reversible termination(CRT) methods) and methods where the nucleotides or nucleosidetriphosphate precursors are unmodified (sometimes referred to as cyclicsingle base delivery (CSD) methods), for example, or more generallymethods that comprise repeated steps of delivering (or extending inresponse to delivering) nucleotides (to the polymerase-primer-templatecomplex) and collecting signals (or detecting the incorporation eitherdirectly or indirectly).

Although various embodiments of the present teachings may advantageouslybe used in connection with pH-based sequence detection, as describedherein and in Rothberg et al., U.S. Pat. Appl. Publ. Nos. 2009/0127589and 2009/0026082 and Rothberg et al., U.K. Pat. Appl. Publ. No.GB2461127, which are all incorporated by reference herein in theirentirety, for example, the present teachings may also be used with otherdetection approaches, including the detection of pyrophosphate (PPi)released by the incorporation reaction (see, e.g., U.S. Pat. Nos.6,210,891; 6,258,568; and 6,828,100); various fluorescence-basedsequencing instrumentation (see, e.g., U.S. Pat. Nos. 7,211,390;7,244,559; and 7,264,929); some sequencing-by-synthesis techniques thatcan detect labels associated with the nucleotides, such as mass tags,fluorescent, and/or chemiluminescent labels (in which case aninactivation step may be included in the workflow (e.g., by chemicalcleavage or photobleaching) prior to the next cycle of synthesis anddetection)); and more generally methods where an incorporation reactiongenerates or results in a product or constituent with a property capableof being monitored and used to detect the incorporation event,including, for example, changes in magnitude (e.g., heat) orconcentration (e.g., pyrophosphate and/or hydrogen ions), and signal(e.g., fluorescence, chemiluminescence, light generation), in whichcases the amount of the detected product or constituent may bemonotonically related to the number of incorporation events, forexample.

The following documents are all incorporated by reference herein intheir entirety: Li et al., International Publ. No. WO/2012/044847,published Apr. 5, 2012; Chen et al., U.S. patent application Ser. No.13/482,542, filed May 29, 2012; Fu et al., “Counting individual DNAmolecules by the stochastic attachment of diverse labels,” PNAS108(22):9026-9031 (May 31, 2011); and Shiroguchi et al., “Digital RNAsequencing minimizes sequence-dependent bias and amplification noisewith optimized single-molecule barcodes,” PNAS 109(4):1347-1352 (Jan.24, 2012).

Although the present description described in detail certainembodiments, other embodiments are also possible and within the scope ofthe present invention. For example, those skilled in the art mayappreciate from the present description that the present teachings maybe implemented in a variety of forms, and that the various embodimentsmay be implemented alone or in combination. Variations and modificationswill be apparent to those skilled in the art from consideration of thespecification and figures and practice of the teachings described in thespecification and figures, and the claims.

What is claimed is:
 1. A method for sequencing a polynucleotide samplehaving a barcode sequence, comprising: introducing a series ofnucleotides to the polynucleotide sample according to a predeterminedflow ordering; obtaining a series of signals resulting from theintroducing of nucleotides to the polynucleotide sample; and resolvingthe series of signals over the barcode sequence to render a flowspacestring, wherein the flowspace string is a codeword of an error-tolerantcode capable of distinguishing the barcode sequence from other barcodesequences in the presence of one or more errors.
 2. The method of claim1, wherein the error-tolerant code is an error-correcting code.
 3. Themethod of claim 2, wherein the error-correcting code is a ternary codeusing a three character alphabet.
 4. The method of claim 3, wherein afirst character of the alphabet represents a 0-mer signal, a secondcharacter in the alphabet represents a 1-mer signal, and a thirdcharacter in the alphabet represents a 2-mer signal.
 5. The method ofclaim 3, wherein the ternary code is a ternary Hamming code.
 6. Themethod of claim 3, wherein the ternary code is a ternary Golay code. 7.The method of claim 1, wherein the codeword has a length of 15 or fewercharacters.
 8. The method of claim 2, wherein the error-correcting codeis capable of correcting up to two errors in the flowspace string. 9.The method of claim 2, wherein the error-correcting code is capable ofcorrecting one error in the flowspace string.
 10. The method of claim 1,wherein the barcode sequence has a length in the range of 5-15 bases.11. The method of claim 1, further comprising determining whether theflowspace string contains an error.
 12. The method of claim 11, furthercomprising correcting the flowspace string if an error is detected. 13.The method of claim 12, further comprising translating the flowspacestring into a base sequence to obtain the sequence of the barcodesequence.
 14. The method of claim 1, further comprising: decoding theflowspace string to the correct flowspace codeword; and translating theflowspace codeword into a base sequence to obtain the sequence of thebarcode sequence.
 15. The method of claim 2, further comprisingproviding multiple different polynucleotide samples for multiplexsequencing by the series of flowed nucleotides, each differentpolynucleotide sample comprising a different barcode sequence that givesa different flowspace string, each different flowspace string being adifferent codeword of the error-correcting code.
 16. The method of claim15, wherein any two flowspace codewords of the error-correcting codehave a Hamming distance of at least
 3. 17. The method of claim 15,wherein any two flowspace codewords of the error-correcting code have aHamming distance of at least
 5. 18. The method of claim 1, furthercomprising one or both of: (i) detecting hydrogen ions released by theincorporation of nucleotides into the polynucleotide sample, wherein theamplitude of the signals is related to the amount of hydrogen ionsdetected, and (ii) detecting inorganic pyrophosphate released by theincorporation of nucleotides into the polynucleotide sample, wherein theamplitude of the signals is related to the amount of inorganicpyrophosphate detected.
 19. A system, including: a machine-readablememory; and a processor configured to execute machine-readableinstructions, which, when executed by the processor, cause the system toperform a method for sequencing a polynucleotide sample having a barcodesequence, comprising: introducing a series of nucleotides to thepolynucleotide sample according to a predetermined flow ordering;obtaining a series of signals resulting from the introducing ofnucleotides to the polynucleotide sample; and resolving the series ofsignals over the barcode sequence to render a flowspace string, whereinthe flowspace string is a codeword of an error-tolerant code capable ofdistinguishing the barcode sequence from other barcode sequences in thepresence of one or more errors.
 20. A set of barcodes for nucleic acidsequencing, comprising: a plurality of nucleic acid base sequences, eachnucleic acid base sequence being a base space representation of aflowspace representation of a codeword of an error-tolerant samplediscrimination code.